50446Haykin
9/18/06
6:24 AM
Page 1
Simon Haykin is University Professor and Director of the Adaptive Systems Laborat...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

50446Haykin

9/18/06

6:24 AM

Page 1

Simon Haykin is University Professor and Director of the Adaptive Systems Laboratory at McMaster University. José C. Príncipe is Distinguished Professor of Electrical and Biomedical Engineering at the University of Florida, Gainesville, where he is BellSouth Professor and Founder and Director of the Computational NeuroEngineering Laboratory. Terrence J. Sejnowski is Francis Crick Professor, Director of the Computational Neurobiology Laboratory, and a Howard Hughes Medical Institute Investigator at the Salk Institute for Biological Studies and Professor of Biology at the University of California, San Diego. John McWhirter is Senior Fellow at QinetiQ Ltd., Malvern, Associate Professor at the Cardiff School of Engineering, and Honorary Visiting Professor at Queen’s University, Belfast.

OF RELATED INTEREST

Probabilistic Models of the Brain PERCEPTION AND NEURAL FUNCTION

edited by Rajesh P. N. Rao, Bruno A. Olshausen, and Michael S. Lewicki The topics covered include Bayesian and information-theoretic models of perception, probabilistic theories of neural coding and spike timing, computational models of lateral and cortico-cortical feedback connections, and the development of receptive field properties from natural signals. Theoretical Neuroscience COMPUTATIONAL AND MATHEMATICAL MODELING OF NEURAL SYSTEMS

Peter Dayan and L. F. Abbott Theoretical neuroscience provides a quantitative basis for describing what nervous systems do, determining how they function, and uncovering the general principles by which they operate. This text introduces the basic mathematical and computational methods of theoretical neuroscience and presents applications in a variety of areas including vision, sensory-motor integration, development, learning, and memory.

The MIT Press Massachusetts Institute of Technology Cambridge, Massachusetts 02142 http://mitpress.mit.edu 0-262-08348-5 978-0-262-08348-5

Neural Information Processing series

New Directions in Statistical Signal Processing Haykin, Príncipe, Sejnowski, and McWhirter, editors

COMPUTER SCIENCE /COMPUTATIONAL NEUROSCIENCE /STATISTICS

New Directions in Statistical Signal Processing FROM SYSTEMS TO BRAINS

New Directions in Statistical Signal Processing FROM SYSTEMS TO BRAINS edited by Simon Haykin, José C. Príncipe, Terrence J. Sejnowski, and John McWhirter

edited by Simon Haykin, José C. Príncipe, Terrence J. Sejnowski, and John McWhirter

Signal processing and neural computation have separately and significantly influenced many disciplines, but the cross-fertilization of the two fields has begun only recently. Research now shows that each has much to teach the other, as we see highly sophisticated kinds of signal processing and elaborate hierarchical levels of neural computation performed side by side in the brain. In New Directions in Statistical Signal Processing, leading researchers from both signal processing and neural computation present new work that aims to promote interaction between the two disciplines. The book’s 14 chapters, almost evenly divided between signal processing and neural computation, begin with the brain and move on to communication, signal processing, and learning systems. They examine such topics as how computational models help us understand the brain’s information processing, how an intelligent machine could solve the “cocktail party problem” with “active audition” in a noisy environment, graphical and network structure modeling approaches, uncertainty in network communications, the geometric approach to blind signal processing, game-theoretic learning algorithms, and observable operator models (OOMs) as an alternative to hidden Markov models (HMMs).

New Directions in Statistical Signal Processing

Neural Information Processing Series Michael I. Jordan and Thomas Dietterich, editors Advances in Large Margin Classiﬁers Alexander J. Smola, Peter L. Bartlett, Bernhard Sch¨ olkopf, and Dale Schuurmans, eds., 2000 Advanced Mean Field Methods: Theory and Practice Manfred Opper and David Saad, eds., 2001 Probabilistic Models of the Brain: Perception and Neural Function Rajesh P. N. Rao, Bruno A. Olshausen, and Michael S. Lewicki, eds., 2002 Exploratory Analysis and Data Modeling in Functional Neuroimaging Friedrich T. Sommer and Andrzej Wichert, eds., 2003 Advances in Minimum Description Length: Theory and Applications Peter D. Grunwald, In Jae Myung, and Mark A. Pitt, eds., 2005 New Directions in Statistical Signal Processing: From Systems to Brain Simon Haykin, Jos´e C. Pr´ıncipe, Terrence J. Sejnowski, and John McWhirter, eds., 2006 Nearest-Neighbor Methods in Learning and Vision: Theory and Practice Gregory Shakhnarovich, Piotr Indyk, and Trevor Darrell, eds., 2006 New Directions in Statistical Signal Processing: From Systems to Brain Simon Haykin, Jos´e C. Pr´ıncipe, Terrence J. Sejnowski, and John McWhirter, eds., 2007

New Directions in Statistical Signal Processing: From Systems to Brain

edited by Simon Haykin Jos´e C. Pr´ıncipe Terrence J. Sejnowski John McWhirter

The MIT Press Cambridge, Massachusetts London, England

c 2007 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. Printed and bound in the United States of America Library of Congress Cataloging-in-Publication Data New directions in statistical signal processing; from systems to brain / edited by Simon Haykin ... [et al.]. p. cm. (Neural information processing series) Includes bibliographical references and index. ISBN10: 0-262-08348-5 (alk. paper) ISBN13: 978-0-262-08348-5 1. Neural networks (Neurobiology) 2. Neural networks (Computer science) 3. Signal processing—Statistical methods. 4. Neural computers. I. Haykin, Simon S., 1931 – II. Series. QP363.3.N52 2006 612.8’2—dc22 2005056210 10 9 8 7 6 5 4 3 2 1

Contents

Series Foreword Preface 1 Modeling the Mind: From Circuits to Systems Suzanna Becker

vii ix 1

2 Empirical Statistics and Stochastic Models for Visual Signals David Mumford

23

3 The Machine Cocktail Party Problem Simon Haykin and Zhe Chen

51

4 Sensor Adaptive Signal Processing of Biological Nanotubes (Ion Channels) at Macroscopic and Nano Scales Vikram Krishnamurthy

77

5 Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging Timothy R. Field

119

6 What Makes a Dynamical System Computationally Powerful? Robert Legenstein and Wolfgang Maass

127

7 A Variational Principle for Graphical Models Martin J. Wainwright and Michael I. Jordan

155

8 Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks Hans-Georg Zimmermann, Ralph Grothmann, Anton Maximilian Sch¨ afer, and Christoph Tietz 9 Diversity in Communication: From Source Coding to Wireless Networks Suhas N. Diggavi

203

243

vi

Contents

10 Designing Patterns for Easy Recognition: Information Transmission with Low-Density Parity-Check Codes Frank R. Kschischang and Masoud Ardakani

287

11 Turbo Processing Claude Berrou, Charlotte Langlais, and Fabrice Seguin

307

12 Blind Signal Processing Based on Data Geometric Properties Konstantinos Diamantaras

337

13 Game-Theoretic Learning Geoﬀrey J. Gordon

379

14 Learning Observable Operator Models via the Eﬃcient Sharpening Algorithm Herbert Jaeger, Mingjie Zhao, Klaus Kretzschmar, Tobias Oberstein, Dan Popovici, and Andreas Kolling

417

References

465

Contributors

509

Index

513

Series Foreword

The yearly Neural Information Processing Systems (NIPS) workshops bring together scientists with broadly varying backgrounds in statistics, mathematics, computer science, physics, electrical engineering, neuroscience, and cognitive science, uniﬁed by a common desire to develop novel computational and statistical strategies for information processing, and to understand the mechanisms for information processing in the brain. As opposed to conferences, these workshops maintain a ﬂexible format that both allows and encourages the presentation and discussion of work in progress, and thus serves as an incubator for the development of important new ideas in this rapidly evolving ﬁeld. The series editors, in consultation with workshop organizers and members of the NIPS Foundation Board, select speciﬁc workshop topics on the basis of scientiﬁc excellence, intellectual breadth, and technical impact. Collections of papers chosen and edited by the organizers of speciﬁc workshops are built around pedagogical introductory chapters, while research monographs provide comprehensive descriptions of workshop-related topics, to create a series of books that provides a timely, authoritative account of the latest developments in the exciting ﬁeld of neural computation. Michael I. Jordan Thomas Dietterich

Preface

In the course of some 60 to 65 years, going back to the 1940s, signal processing and neural computation have evolved into two highly pervasive disciplines. In their own individual ways, they have signiﬁcantly inﬂuenced many other disciplines. What is perhaps surprising to see, however, is the fact that the cross-fertilization between signal processing and neural computation is still very much in its infancy. We only need to look at the brain and be amazed by the highly sophisticated kinds of signal processing and elaborate hierarchical levels of neural computation, which are performed side by side and with relative ease. If there is one important lesson that the brain teaches us, it is summed up here: There is much that signal processing can learn from neural computation, and vice versa. It is with this aim in mind that in October 2003 we organized a one-week workshop on “Statistical Signal Processing: New Directions in the Twentieth Century,” which was held at the Fairmont Lake Louise Hotel, Lake Louise, Alberta. To fulﬁll that aim, we invited some leading researchers from around the world in the two disciplines, signal processing and neural computation, in order to encourage interaction and cross-fertilization between them. Needless to say, the workshop was highly successful. One of the most satisfying outcomes of the Lake Louise Workshop is that it has led to the writing of this new book. The book consists of 14 chapters, divided almost equally between signal processing and neural computation. To emphasize, in some sense, the spirit of the above-mentioned lesson, the book is entitled New Directions in Statistical Signal Processing: From Systems to Brain. It is our sincere hope that in some measurable way, the book will prove helpful in realizing the original aim that we set out for the Lake Louise Workshop. Finally, we wish to thank Dr. Zhe Chen, who had spent tremendous eﬀorts and time in LATEX editing and proofreading during the preparation and ﬁnal production of the book. Simon Haykin Jos´e C. Pr´ıncipe Terrence J. Sejnowski John McWhirter

1

Modeling the Mind: From Circuits to Systems

Suzanna Becker

Computational models are having an increasing impact on neuroscience, by shedding light on the neuronal mechanisms underlying information processing in the brain. In this chapter, we review the contribution of computational models to our understanding of how the brain represents and processes information at three broad levels: (1) sensory coding and perceptual processing, (2) high-level memory systems, and (3) representations that guide actions. So far, computational models have had the greatest impact at the earliest stages of information processing, by modeling the brain as a communication channel and applying concepts from information theory. Generally, these models assume that the goal of sensory coding is to map the high-dimensional sensory signal into a (usually lower-dimensional) code that is optimal with respect to some measure of information transmission. Four informationtheoretic coding principles will be considered here, each of which can be used to derive unsupervised learning rules, and has been applied to model multiple levels of cortical organization. Moving beyond perceptual processing to high-level memory processes, the hippocampal system in the medial temporal lobe (MTL) is a key structure for representing complex conﬁgurations or episodes in long-term memory. In the hippocampal region, the brain may use very diﬀerent optimization principles aimed at the memorization of complex events or spatiotemporal episodes, and subsequent reconstruction of details of these episodic memories. Here, rather than recoding the incoming signals in a way that abstracts away unnecessary details, the goal is to memorize the incoming signal as accurately as possible in a single learning trial. Most eﬀorts at understanding hippocampal function through computational modeling have focused on sub-regions within the hippocampal circuit such as the CA3 or CA1 regions, using “oﬀ-the-shelf” learning algorithms such as competitive learning or Hebbian pattern association. More recently, Becker proposed a global optimization principle for learning within this brain region. Based on the goal of accurate input reconstruction, combined with neuroanatomical constraints, this leads to simple, biologically plausible learning rules for all regions within the hippocampal circuit. The model exhibits the key features of an episodic memory

2

Modeling the Mind: From Circuits to Systems

system: the capacity to store a large number of distinct, complex episodes, and to recall a complete episode from a minimal cue, and associate items across time, under extremely high plasticity conditions. Finally, moving beyond the static representation of information, we must consider the brain not simply a passive recipient of information, but as a complex, dynamical system, with internal goals and the ability to select actions based on environmental feedback. Ultimately, models based on the broad goals of prediction and control, using reinforcement-driven learning algorithms, may be the best candidates for characterizing the representations that guide motor actions. Several examples of models are described that begin to address the problem of how we learn representations that can guide our actions in a complex environment.

1.1

Introduction How does the brain process, represent, and act on sensory signals? Through the use of computational models, we are beginning to understand how neural circuits perform these remarkably complex information-processing tasks. Psychological and neurobiological studies have identiﬁed at least three distinct long-term memory systems in the brain: (1) the perceptual/semantic memory system in the neocortex learns gradually to represent the salient features of the environment; (2) The episodic memory system in the medial temporal lobe learns rapidly to encode complex events, rich in detail, characterizing a particular episode in a particular place and time; (3) the procedural memory system, encompassing numerous cortical and subcortical structures, learns sensory-motor mappings. In this chapter, we consider several major developments in computational modeling that shed light on how the brain learns to represent information at three broad levels, reﬂecting these three forms of memory: (1) sensory coding, (2) episodic memory, and (3) representations that guide actions. Rather than providing a comprehensive review of all models in these areas, our goal is to highlight some of the key developments in the ﬁeld, and to point to the most promising directions for future work.

1.2

Sensory Coding At the earliest stages of sensory processing in the cortex, quite a lot is known about the neural coding of information, from Hubel and Wiesel’s classic ﬁndings of orientation-selective neurons in primary visual cortex (Hubel and Wiesel, 1968) to more recent studies of spatiotemporal receptive ﬁelds in visual cortex (DeAngelis et al., 1993) and spectrotemporal receptive ﬁelds in auditory cortex (Calhoun and Schreiner, 1998; Kowalski et al., 1996). Given the abundance of electrophysiological data to constrain the development of computational models, it is not surprising that most models of learning and memory have focused on the early stages of sen-

1.2

Sensory Coding

3

sory coding. One approach to modeling sensory coding is to hand-design ﬁlters, such as the Gabor or diﬀerence-of-Gaussians ﬁlter, so as to match experimentally observed receptive ﬁelds. However, this approach has limited applicability beyond the very earliest stages of sensory processing for which receptive ﬁelds have been reasonably well mapped out. A more promising approach is to try to understand the developmental processes that generated the observed data. Note that these could include both learning and evolutionary factors, but here our focus is restricted to potential learning mechanisms. The goal is then to discover the general underlying principles that cause sensory systems to self-organize their receptive ﬁelds. Once these principles have been uncovered, they can be used to derive models of learning. One can then simulate the developmental process by exposing the model to typical sensory input and comparing the results to experimental observations. More important, one can simulate neuronal functions that might not have been conceived by experimentalists, and thereby generate novel experimental predictions. Several classes of computational models have been inﬂuential in guiding current thinking about self-organization in sensory systems. These models share the general feature of modeling the brain as a communication channel and applying concepts from information theory. The underlying assumption of these models is that the goal of sensory coding is to map the high-dimensional sensory signal into another (usually lower-dimensional) code that is somehow optimal with respect to information content. Four information-theoretic coding principles will be considered here: (1) Linsker’s Infomax principle, (2) Barlow’s redundancy reduction principle, 3) Becker and Hinton’s Imax principle, and (4) Risannen’s minimum description length (MDL) principle. Each of these principles has been used to derive models of learning and has inspired further research into related models at multiple stages of information processing. 1.2.1

Infomax principle

Linsker’s Infomax Principle

How should neurons respond to the sensory signal, given that it is noisy, highdimensional and highly redundant? Is there a more convenient form in which to encode signals so that we can make more sense of the relevant information and take appropriate actions? In the human visual system, for example, there are hundreds of millions of photoreceptors converging onto about two million optic nerve ﬁbers. By what principle does the brain decide what information to discard and what to preserve? Linsker proposed a model of self-organization in sensory systems based on the Infomax principle: Each neuron adjusts its connection strengths or weights so as to maximize the amount of Shannon information in the neural code that is conveyed about the sensory input (Linsker, 1988). In other words, the Infomax principle dictates that neurons should maximize the amount of mutual information between their input x and output y: Ix;y = ln [p(x|y)/p(x)]

4

Modeling the Mind: From Circuits to Systems

Environmental input

Linsker’s multilayer architecture for learning center-surround and oriented receptive ﬁelds. Higher layers learned progressively more ‘Mexican-hatlike’ receptive ﬁelds. The inputs consisted of uncorrelated noise, and in each layer, center-surround receptive ﬁelds evolved with progressively greater contrast between center and surround.

Figure 1.1

Assuming that the input consists of a multidimensional Gaussian signal with additive, independent Gaussian noise with variance V (n), for a single neuron whose output y is a linear function of its inputs and connection weights w, the mutual information is the log of the signal-to-noise ratio: Ix;y =

independent components analysis

1 V (y) ln 2 V (n)

Linsker showed that a simple, Hebb-like weight update rule approximately maximizes this information measure. The center-surround receptive ﬁeld (with either an on-center and oﬀ-surround or oﬀ-center and on-surround spatial pattern of connection strengths) is characteristic of neurons in the earliest stages of the visual pathways including the retina and lateral geniculate nucleus (LGN) of the thalamus. Surprisingly, Linsker’s simulations using purely uncorrelated random inputs, and a multi-layer circuit as shown in ﬁg 1.1, showed that neurons in successive layers developed progressively more “Mexican-hat” shaped receptive ﬁelds (Linsker, 1986a,b,c), reminiscent of the center-surround receptive ﬁelds seen in the visual system. In further developments of the model, using a two-dimensional sheet of neurons with local-neighbor lateral connections, Linsker (1989) showed that the model selforganized topographic maps with oriented receptive ﬁelds, such that nearby units on the map developed similarly oriented receptive ﬁelds. This organization is a good ﬁrst approximation to that of the primary visual cortex. The Infomax principle has been highly inﬂuential in the study of neural coding, going well beyond Linsker’s pioneering work in the linear case. One of the major developments in this ﬁeld is Bell and Sejnowksi’s Infomax-based independent components analysis (ICA) algorithm, which applies to nonlinear mappings with equal numbers of inputs and outputs (Bell and Sejnowski, 1995). Bell and Sejnowski

1.2

Sensory Coding

5

showed that when the mapping from inputs to outputs is continuous, nonlinear, and invertible, maximizing the mutual information between inputs and outputs is equivalent to simply maximizing the entropy of the output signal. The algorithm therefore performs a form of ICA. Infomax-based ICA has also been used to model receptive ﬁelds in visual cortex. When applied to natural images, in contrast to principal component analysis (PCA), Infomax-based ICA develops oriented receptive ﬁelds at a variety of spatial scales that are sparse, spatially localized, and reminiscent of oriented receptive ﬁelds in primary visual cortex (Bell and Sejnowski, 1997). Another variant of nonlinear Infomax developed by Okajima and colleagues (Okajima, 2004) has also been applied to modeling higher levels of visual processing, including combined binocular disparity and spatial frequency analysis. 1.2.2

redundancy reduction principle

Barlow’s Redundancy Reduction Principle

The principle of preserving information may be a good description of the very earliest stages of sensory coding, but it is unlikely that this one principle will capture all levels of processing in the brain. Clearly, one can trivially preserve all the information in the input simply by copying the input to the next level up. Thus, the idea only makes sense in the context of additional processing constraints. Implicit in Linsker’s work was the constraint of dimension reduction. However, in the neocortex, there is no evidence of a progressive reduction in the number of neurons at successively higher levels of processing. Barlow proposed a slightly diﬀerent principle of self-organization based on the idea of producing a minimally redundant code. The information about an underlying signal of interest (such as the visual form or the sound of a predator) may be distributed across many input channels. This makes it diﬃcult to associate particular stimulus values with distinct responses. Moreover, there is a high degree of redundancy across diﬀerent channels. Thus, a neural code having minimal redundancy should make it easier to associate diﬀerent stimulus values with diﬀerent responses. The formal, information-theoretic deﬁnition of redundancy is the information content of the stimulus, less the capacity of the channel used to convey the information. Unfortunately, quantities dependent upon calculation of entropy are diﬃcult to compute. Thus, several diﬀerent formulations of Barlow’s principle have been proposed, under varying assumptions and approximations. One simple way for a learning algorithm to lower redundancy is reduce correlations among the outputs (Barlow and F¨ oldi´ ak, 1989). This can remove second-order but not higher-order dependencies. Atick and Redlich proposed minimizing the following measure of redundancy (Atick and Redlich, 1990): R=1−

Iy;s Cout (y)

6

Modeling the Mind: From Circuits to Systems

s Input channel x=s+v1

x Recoding Ax+v2

y Atick and Redlich’s learning principle was to minimize redundancy in the output, y , while preserving information about the input, x.

Figure 1.2

channel capacity

subject to the constraint of zero information loss (ﬁxed Iy;s ). Cout (y), the output channel capacity, is deﬁned to be the maximum of Iy;s . The channel capacity is at a maximum when the covariance matrix of the output elements is diagonal, hence Atick and Redlich used 1 Ryy Cout (y) = 2 i Nv 22 ii Thus, under this formulation, minimizing redundancy amounts to minimizing the the channel capacity. This model is depicted in ﬁg 1.2. This model was used to simulate retinal receptive ﬁelds. Under conditions of high noise (low redundancy), the receptive ﬁelds that emerged were Gaussian-shaped spatial smoothing ﬁlters, while at low noise levels (high redundancy) on-center oﬀ-surround receptive ﬁelds resembling second spatial derivative ﬁlters emerged. In fact, cells in the mammalian retina and lateral geniculate nucleus of the thalamus dynamically adjust their ﬁltering characteristics as light levels ﬂuctuate between these two extremes under conditions of low versus high contrast (Shapley and Victor, 1979; Virsu et al., 1977). Moreover, this strategy of adaptive rescaling of neural responses has been shown to be optimal with respect to information transmission (Brenner et al., 2000). Similar learning principles have been applied by Atick and colleagues to model higher stages of visual processing. Dong and Atick modeled redundancy reduction across time, in a model of visual neurons in the lateral geniculate nucleus of the thalamus (Dong and Atick, 1995). In their model, neurons with both lagged and nonlagged spatiotemporal smoothing ﬁlters emerged. These receptive ﬁelds would be useful for conveying information about stimulus onsets and oﬀsets. Li and Atick (1994) modeled redundancy reduction across binocular visual inputs. Their model generated binocular, oriented receptive ﬁelds at a variety of spatial scales, similar to those seen in primary visual cortex. Bell and Sejnowski’s Infomax-based ICA algorithm (Bell and Sejnowski, 1995) is also closely related to Barlow’s minimal redundancy principal, since the ICA model is restricted to invertible mappings; in

1.2

Sensory Coding

7

X1

f1(s + n1)

a

X2

f2 (s + n2 )

b Maximize l(a;b)

Becker and Hinton’s Imax learning principle maximizes the mutual information between features a and b extracted from diﬀerent input channels.

Figure 1.3

this case, the maximization of mutual information amounts to reducing statistical dependencies among the outputs. 1.2.3

Becker and Hinton’s Imax Principle

The goal of retaining as much information as possible may be a good description of early sensory coding. However, the brain seems to do much more than simply preserve information and recode it into a more convenient form. Our perceptual systems are exquisitely tuned to certain regularities in the world, and consequently to irregularities which violate our expectations. The things which capture our attention and thus motivate us to learn and act are those which violate our expectations about the coherence of the world—the sudden onset of a sound, the appearance of a looming object or a predator. In order to be sensitive to changes in our environment, we require internal representations which ﬁrst capture the regularities in our environment. Even relatively low-order regularities, such as the spatial and temporal coherence of sensory signals, convey important cues for extracting very high level properties about objects. For example, the coherence of the visual signal across time and space allows us to segregate the parts of a moving object from its surrounding background, while the coherence of auditory events across frequency and time permits the segregation of the auditory input into its multiple distinct sources. Becker and Hinton (1992) proposed the Imax principle for unsupervised learning, which dictates that signals of interest should have high mutual information across diﬀerent sensory channels. In the simplest case, illustrated in ﬁg 1.3, there are two input sources, x1 and x2 , conveying information about a common underlying Gaussian signal of interest, s, and each channel is corrupted by independent, additive Gaussian noise: x1 = s + n1 , x2 = s + n2 . However, the input may be high dimensional and may require a nonlinear transformation in order to extract the signal. Thus the goal of the learning is to transform the two input signals into outputs, y1 and y2 , having maximal mutual

8

Modeling the Mind: From Circuits to Systems

Maximize I(a;b)

a

b

Left strip Right strip

Figure 1.4

Imax architecture used to learn stereo features from binary images.

information. Because the signal is Gaussian and the noise terms are assumed to be identically distributed, the information in common to the two outputs can be maximized by maximizing the following log signal-to-noise ratio (SNR): V (y1 + y2 ) Iy1 ;y2 ≈ log V (y1 − y2 ) This could be accomplished by multiple stages of processing in a nonlinear neural circuit like the one shown in ﬁg 1.4. Becker and Hinton (1992) showed that this model could extract binocular disparity from random dot stereograms, using the architecture shown in ﬁg 1.4. Note that this function requires multiple stages of processing through a network of nonlinear neurons with sigmoidal activation functions. The Imax algorithm has been used to learn temporally coherent features (Becker, 1996; Stone, 1996), and extended to learn multidimensional features (Zemel and Hinton, 1991). A very similar algorithm for binary units was developed by Kay and colleagues (Kay, 1992; Phillips et al., 1998). The minimizing disagreement algorithm (de Sa, 1994) is a probabilistic learning procedure based on principles of Bayesian classiﬁcation, but is nonetheless very similar to Imax in its objective to extract classes that are coherent across multiple sensory input channels. 1.2.4

minimum description length

Risannen’s Minimum Description Length Principle

The overall goal of every unsupervised learning algorithm is to discover the important underlying structure in the data. Learning algorithms based on Shannon information have the drawback of requiring knowledge of the probability distribution of the data, and/or of the extracted features, and hence tend to be either very computationally expensive or to make highly simplifying assumptions about the distributions (e.g. binary or Gaussian variables). An alternative approach is to develop a model of the data that is somehow optimal with respect to coding eﬃciency. The minimum description length (MDL) principle, ﬁrst introduced by

1.2

Sensory Coding

9

d data

Figure 1.5

M model

c code

-1

M

d’ reconstructed data

Minimum description length (MDL) principle.

Rissanen (1978), favors models that provide accurate encoding of the data using as simple a model as possible. The rationale behind the MDL principle is that the criterion of discovering statistical regularities in data can be quantiﬁed by the length of the code generated to describe the data. A large number of learning algorithms have been developed based on the MDL principle, but only a few of these have attempted to provide plausible accounts of neural processing. One such example was developed by Zemel and Hinton (1995), who cast the autoencoder problem within an MDL framework. They proposed that the goal of learning should be to encode the total cost of communicating the input data, which depends on three terms, the length of the code, c, the cost of communicating the model, M (which depends on the coding cost of communicating how to reconstruct the data), M −1 , and the reconstruction error: Cost = Length(c) + Length(M −1 ) + Length(|d − d |) as illustrated in ﬁg 1.5. They instantiated these ideas using an autoencoder architecture, with hidden units whose activations were Gaussian functions of the inputs. Under a Gaussian model of the input activations, it was assumed that the hidden unit activations, as a population, encode a point in a lower-dimensional implicit representational space. For example, a population of place cells in the hippocampus might receive very high dimensional multisensory input, and map this input onto a population of neural activations which codes implicitly the animal’s spatial location—a point in a two-dimensional Cartesian space. The population response could be decoded by averaging together the implicit coordinates of the hidden units, weighted by their activations. Zemel and Hinton’s cost function incorporated a reconstruction term and a coding cost term that measured the ﬁt of the hidden unit activations to a Gaussian model of implicit coordinates. The weights of the hidden units and the coordinates in implicit space were jointly optimized with respect to this MDL cost. Algorithms which perform clustering, when cast within a statistical framework, can also be viewed as a form of MDL learning. Nowlan derived such an algorithm, called maximum likelihood competitive learning (MLCL), for training neural net-

10

Modeling the Mind: From Circuits to Systems

works using the expectation maximization (EM) algorithm (Jacobs et al., 1991; Nowlan, 1990). In this framework, the network is viewed as a probabilistic, generative model of the data. The learning serves to adjust the weights so as to maximize the log likelihood of the model having generated the data: L = log P (data | model). If the training patterns, I (α) , are independent, n

L = log

P (I (α) | model)

α=1

=

n

log P (I (α) | model).

α=1

The MLCL algorithm applies this objective function to the case where the units have Gaussian activations and form a mixture model of the data: m n (α) log P (I | submodeli ) P (submodeli ) L= =

α=1 n α=1

log

i=1 m

yi

(α)

πi ,

i=1

where the πi ’s are positive mixing coeﬃcients that sum to one, and the yi ’s are the unit activations: i , Σi ), yi (α) = N (I(α) , w i and covariance matrix where N ( ) is the Gaussian density function, with mean w Σi . The MLCL model makes the assumption that every pattern is independent of every other pattern. However, this assumption of independence is not valid under natural viewing conditions. If one view of an object is encountered, a similar view of the same object is likely to be encountered next. Hence, one powerful cue for real vision systems is the temporal continuity of objects. Novel objects typically are encountered from a variety of angles, as the position and orientation of the observer, or objects, or both, vary smoothly over time. Given the importance of temporal context as a cue for feature grouping and invariant object recognition, it is very likely that the brain makes use of this property of the world in perceptual learning. Becker (1999) proposed an extension to MLCL that incorporates context into the learning. Relaxing the assumption that the patterns are independent, allowing for temporal dependencies among the input patterns, the log likelihood function becomes: L = log P (data | model) log P (I (α) | I (1) , . . . , I (α−1) , model). = α

1.2

Sensory Coding

11

context

Secondary Input source

ci gj

Modulatory signal

model

data

Figure 1.6

Gating units

yk

Clustering units

Im

Input units

Contextually modulated competitive learning.

To incorporate a contextual information source into the learning equation, a contextual input stream was introduced into the likelihood function: L = log P (data | model, context) log P (I (α) | I (1) , . . . , I (α−1) , model, context), = α

as depicted in ﬁg 1.6. This model was trained on a series of continuously rotating images of faces, and learned a representation that categorized people’s faces according to identity, independent of viewpoint, by taking advantage of the temporal continuity in the image sequences. Many models of population encoding apply to relatively simple, one-layer feedforward architectures. However, the structure of neocortex is much more complex. There are multiple cortical regions, and extensive feedback connections both within and between regions. Taking these features of neocortex into account, Hinton has developed a series of models based on the Boltzmann machine (Ackley et al., 1985), and the more recent Helmholtz machine (Dayan et al., 1995) and Product of Experts (PoE) model(Hinton, 2000; Hinton and Brown, 2000). The common idea underlying these models is to try to ﬁnd a population code that forms a causal model of the underlying data. The Boltzmann machine was unacceptably slow at sampling the “unclamped” probability distribution of the unit states. The Helmholtz machine and PoE model overcome this limitation by using more restricted architectures and/or approximate methods for sampling the probability distributions over units’ states (see ﬁg 1.7A). In both cases, the bottom-up weights embody a “recognition model”; that is, they are used to produce the most probable set of hidden states given the data. At the same time, the top-down weights constitute a “generative model”; that is, they produce a set of hidden states most likely to have generated the data. The “wake-sleep algorithm” maximizes the log likelihood

A)

experts

p1

d1

d2

p2

d3

inference weights

Modeling the Mind: From Circuits to Systems

generative weights

12

data

P0

B)

P1

d1 d2 d3

c1 c2 c3

Hinton’s Product of Experts model, showing (A) the basic architecture, and (B) Brief Gibbs sampling, which involves several alternating iterations of clamping the input units to sample from the hidden unit states, and then clamping the hidden units to sample from the input unit states. This procedure samples the “unclamped” distribution of states in a local region around each data vector and tries to minimize the diﬀerence between the clamped and unclamped distributions.

Figure 1.7

of the data under this model and results in a simple equation for updating either set of weights: α α Δwkj = εsα k (sj − pj ), α where pα j is the target state for unit j on pattern α, and sj is the corresponding network state, a stochastic sample based on the logistic function of the unit’s net input. Target states for the generative weight updates are derived from topdown expectations based on samples using the recognition model, whereas for the recognition weights, the targets are derived by making bottom-up predictions based on samples from the generative model. The Products of Experts model advances on this learning procedure by providing a very eﬃcient procedure called “brief Gibbs sampling” for estimating the most probable states to have generated the data, as illustrated in ﬁg 1.7B).

1.3

1.3

Models of Episodic Memory

13

Models of Episodic Memory Moving beyond sensory coding to high-level memory systems in the medial temporal lobe (MTL), the brain may use very diﬀerent optimization principles aimed at the memorization of complex events or spatiotemporal episodes, and at subsequent reconstruction of details of these episodic memories. Here, rather than recoding the incoming signals in a way that abstracts away unnecessary details, the goal is to memorize the incoming signal as accurately as possible in a single learning trial. The hippocampus is a key structure in the MTL that appears to be crucial for episodic memory. It receives input from most cortical regions, and is at the point of convergence between the ventral and dorsal visual pathways, as illustrated in ﬁg 1.8 (adapted from (Mishkin et al., 1997)). Some of the unique anatomical and physiological characteristics of the hippocampus include the following: (1) the very large expansion of dimensionality from the entorhinal cortex (EC) to the dentate gyrus (DG) (the principal cells in the dentate gyrus outnumber those of the EC by about a factor of 5 in the rat (Amaral et al., 1990)); (2) the large and potent mossy ﬁber synapses projecting from CA3 to CA1, which are the largest synapses in the brain and have been referred to as “detonator synapses” (McNaughton and Morris, 1987); and (3) the extensive set of recurrent collateral connections within the CA3 region. In addition, the hippocampus exhibits unique physiological properties including (1) extremely sparse activations (low levels of activity), particularly in the dentate gyrus where ﬁring rates of granule cells are about 0.5 Hz (Barnes et al., 1990; Jung and McNaughton, 1993), and (2) the constant replacement of neurons (neurogenesis) in the dentate gyrus: about about 1% of the neurons in the dentate gyrus are replaced each day in young adult rats (Martin Wojtowicz, University of Toronto, unpublished data). In 1971 Marr put forward a highly inﬂuential theory of hippocampal coding (Marr, 1971). Central to Marr’s theory were the notions of a rapid, temporary memory store mediated by sparse activations and Hebbian learning, an associative retrieval system mediated by recurrent connections, and a gradual consolidation process by which new memories would be transferred into a long-term neocortical store. In the decades since the publication of Marr’s computational theory, many researchers have built on these ideas and simulated memory formation and retrieval in Marr-like models of the hippocampus. For the most part, modelers have focused on either the CA3 or CA1 ﬁelds, using variants of Hebbian learning, for example, competitive learning in the dentate gyrus and CA3 (Hasselmo et al., 1996; McClelland et al., 1995; Rolls, 1989), Hebbian autoassociative learning (Kali and Dayan, 2000; Marr, 1971; McNaughton and Morris, 1987; O’Reilly and Rudy, 2001; Rolls, 1989; Treves and Rolls, 1992), temporal associative learning (Gerstner and Abbott, 1997; Levy, 1996; Stringer et al., 2002; Wallenstein and Hasselmo, 1997) in the CA3 recurrent collaterals, and Hebbian heteroassociative learning between EC-driven CA1 activity and CA3 input (Hasselmo and Schnell, 1994) or between EC-driven and CA3-driven CA1 activity at successive points in time (Levy et al., 1990). The key ideas behind these models are summarized in ﬁg 1.9.

14

Modeling the Mind: From Circuits to Systems

CA1

CA3

HPC DG

Sub

Entorhinal Cortex

Perirhinal cortex

Ventral stream inputs

Parahippocampal cortex

Dorsal stream inputs

Some of the main anatomical connections of the hippocampus. The hippocampus is a major convergence zone. It receives input via the entorhinal cortex from most regions of the brain including the ventral and dorsal visual pathways. It also sends reciprocal projections back to most regions of the brain. Within the hippocampus, the major regions are the dentate gyrus (DG), CA3, and CA1. The CA1 region projects back to the entorhinal cortex, thus completing the loop. Note that the subiculum, not shown here, is another major output target of the hippocampus.

Figure 1.8

In modeling the MTL’s hippocampal memory system, Becker (2005) has shown that a global optimization principle based on the goal of accurate input reconstruction, combined with neuroanatomical constraints, leads to simple, biologically plausible learning rules for all regions within the hippocampal circuit. The model exhibits the key features of an episodic memory system: high storage capacity, accurate cued recall, and association of items across time, under extremely high plasticity conditions. The key assumptions in Becker’s model are as follows: During encoding, dentate granule cells are active whereas during retrieval they are relatively silent. During encoding, activation of CA3 pyramidals is dominated by the very strong mossy ﬁber inputs from dentate granule cells. During retrieval, activation of CA3 pyramidals is driven by direct perforant path inputs from the entorhinal cortex combined with time-delayed input from CA3 via recurrent collaterals. During encoding, activation of CA1 pyramidals is dominated by direct perforant path inputs from the entorhinal cortex.

1.3

Models of Episodic Memory

15

A)

1 0 1 0

C) 1 0 0 1 1 1 0 0

B)

1 0 1 0

Various models have been proposed for speciﬁc regions of the hippocampus, for example, (A) models based on variants of competitive learning have been proposed for the dentate gyrus; (B) many models of the CA3 region have been based upon the recurrent autoassociator, and (C) several models of CA1 have been based on the heteroassociative network, where the input from the entorhinal cortex to CA1 acts as a teaching signal, to be associated with the (nondriving) input from the CA3 region.

Figure 1.9

During retrieval, CA1 activations are driven by a combination of perforant path inputs from the entorhinal cortex and Shaﬀer collateral inputs from CA3. Becker proposed that each hippocampal layer should form a neural representation that could be transformed in a simple manner—i.e. linearly—to reconstruct the original activation pattern in the entorhinal cortex. With the addition of biologically plausible processing constraints regarding connectivity, sparse activations, and two modes of neuronal dynamics during encoding versus retrieval, this results in very simple Hebbian learning rules. It is important to note, however, that the model itself is highly nonlinear, due to the sparse coding in each region and the multiple stages of processing in the circuit as a whole; the notion of linearity only comes in at the point of reconstructing the EC activation pattern from any one region’s activities. The objective function made use of the idea of an implicit set of reconstruction weights from each hippocampal region, by assuming that the perforant path connection weights could be used in reverse to reconstruct the EC input pattern. Taking the CA3 layer as an example, the CA3 neurons receive perforant path input from the entorhinal cortex, EC (in) , associated with a matrix of weights W (EC,CA3) . The CA3 region also receives input connections from the dentate gyrus, DG, with associated weights W (DG,CA3) as well

16

Modeling the Mind: From Circuits to Systems

as recurrent collateral input from within the CA3 region with connection weights W (CA3,CA3) . Using the transpose of the perforant path weights, (W (EC,CA3) )T , to calculate the CA3 region’s reconstruction of the entorhinal input vector T

EC (reconstructed) = W (EC,CA3) CA3,

(1.1)

the goal of the learning is to make this reconstruction as accurate as possible. To quantify this goal, the objective function Becker proposed to be maximized here is the cosine angle between the original and reconstructed activations: T

P erf (CA3) = cos(EC (in) , W (EC,CA3) CA3) T

=

(EC (in) )T (W (EC,CA3) CA3) T

||EC (in) || ||W (EC,CA3) CA3||

.

(1.2)

By rearranging the numerator, and appropriately constraining the activation levels and the weights so that the denominator becomes a constant, it is equivalent to maximize the following simpler expression: T

P erf (CA3) = (W (EC,CA3) EC (in) ) CA3,

(1.3)

which makes use of the locally available information arriving at the CA3 neurons’ incoming synapses: the incoming weights and activations. This says that the incoming weighted input from the perforant path should be as similar as possible to the activation in the CA3 layer. Note that the CA3 activation, in turn, is a function of both perforant path and DG input as well as CA3 recurrent input. The objective functions for the dentate and CA1 regions have exactly the same form as equation 1.3, using the DG and CA1 activations and perforant path connection weights respectively. Thus, the computational goal for the learning in each region is to maximize the overlap between the perforant path input and that region’s reconstruction of the input. This objective function can be maximized with respect to the connection weights on each set of input connections for a given layer, to derive a set of learning equations. By combining the learning principle with the above constraints, Hebbian learning rules are derived for the direct (monosynaptic) pathways from the entorhinal cortex to each hippocampal region, a temporal Hebbian associative learning rule is derived for the CA3 recurrent collateral connections, and a form of heteroassociative learning is derived the Shaﬀer collaterals (the projection from CA3 to CA1). Of fundamental importance for computational theories of hippocampal coding is the striking ﬁnding of neurogenesis in the adult hippocampus. Although there is now a large literature on neurogenesis in the dentate gyrus, and it has been shown to be important for at least one form of hippocampal-dependent learning, surprisingly few attempts have been made to reconcile this phenomenon with theories of hippocampal memory formation. Becker (2005) suggested that the function of new neurons in the dentate gyrus is in the generation of novel codes. Gradual changes in the internal code of the dentate layer were predicted to facilitate the formation of distinct representations for highly similar memory episodes.

1.4

Representations That Guide Action Selection

17

environmental feedback

stimulus coding

posterior cortical areas

memorization of complex events

medial temporal lobe, hippocampus

prediction and control

motor output

prefrontal cortex

Architecture of a learning system that incorporates perceptual learning, episodic memory, and motor control.

Figure 1.10

Why doesn’t the constant turnover of neurons in the dentate gyrus, and hence the constant rewiring of the hippocampal memory circuit, interfere with the retrieval of old memories? The answer to this question comes naturally from the above assumptions about neuronal dynamics during encoding versus retrieval. New neurons are added only to the dentate gyrus, and the dentate gyrus drives activation in the hippocampal circuit only during encoding, not during retrieval. Thus, the new neurons contribute to the formation of distinctive codes for novel events, but not to the associative retrieval of older memories.

1.4

Representations That Guide Action Selection Moving beyond the question of how information is represented, we must consider the brain not simply a passive storage device, but as a part of a dynamical computational system that acts and reacts to changes within its environment, as illustrated in ﬁg 1.10. Ultimately, models based on the broad goals of prediction and control may be our best hope for characterizing complex dynamical systems which form representations in the service of guiding motor actions. Reinforcement learning algorithms can be applied to control problems, and have been linked closely to speciﬁc neural mechanisms. These algorithms are built upon the concept of a value function, V (st ), which deﬁnes the value of being in the current state st at time t to be equal to the expected sum of future rewards: V (t) = rt + γrt+1 + γ 2 rt2 + . . . + γ n rtn + . . . The parameter γ, chosen to be in the range 0 ≤ γ ≤ 1, is a temporal discount

18

TD-learning

Q-learning

Modeling the Mind: From Circuits to Systems

factor which permits one to heuristically weight future rewards more or less heavily according to the task demands. Within this framework, the goal for the agent is to choose actions that will maximize the value function. In order for the agent to solve the control problem—how to select optimal actions, it must ﬁrst solve the prediction problem—how to estimate the value function. The temporal diﬀerence (TD) learning algorithm (Sutton, 1988; Sutton and Barto, 1981) provides a rule for incrementally updating an estimate Vˆt of the true value function at time t by an amount called the TD-error: TD-error = rt+1 + γ Vˆt+1 − Vˆt , which makes use of rt , the amount of reward received at time t, and the value estimates at the current and the next time step. It has been proposed that the TD-learning algorithm may be used by neurobiological systems, based on evidence that ﬁring of midbrain dopamine neurons correlates well with TD-error (Montague et al., 1996). The Q-learning algorithm (Watkins, 1989) extends the idea of TD learning to the problem of learning an optimal control policy for action selection. The goal for the agent is to maximize the total future expected reward. The agent learns incrementally by trial and error, evaluating the consequences of taking each action in each situation. Rather than using a value function, Q-learning employs an actionvalue function, Q(st , at ), which represents the value in taking an action at when the state of the environment is st . The learning algorithm for incrementally updating estimates of Q-values is directly analogous to TD learning, except that the TD-error is replaced by a temporal diﬀerence between Q-values at successive points in time. Becker and Lim (2003) proposed a model of controlled memory retrieval based upon Q-learning. People have a remarkable ability to encode and retrieve information in a ﬂexible manner. Understanding the neuronal mechanisms underlying strategic memory use remains a true challenge. Neural network models of memory have typically dealt with only the most basic operations involved in storage and recall. Evidence from patients with frontal lobe damage indicates a crucial role for the prefrontal cortex in the control of memory. Becker and Lim’s model was developed to shed light on the neural mechanisms underlying strategic memory use in individuals with intact and lesioned frontal lobes. The model was trained to simulate human performance on free-recall tasks involving lists of words drawn from a small set of categories. Normally when people are asked repeatedly to study and recall the same list of words, their recall patterns demonstrate progressively more categorical clustering over trials. This strategy thus appears to be learned, and correlates with overall recall scores. On the other hand, when patients with frontal lobe damage perform such tests, while they do beneﬁt somewhat from the categorical structure of word lists, they tend to recall fewer categories in total, and tend to show lower semantic clustering scores. Becker and Lim (2003) postulated a role for the prefrontal cortex (PFC) in self-organizing novel mnemonic codes that could subsequently be used as retrieval cues to improve retrieval from long-term memory. Their model is outlined in ﬁg 1.11. The “actions” or responses in this model are actually the activations generated by model neurons in the PFC module. Thus, the activation of each response unit is proportional to the network’s current estimate of the Q-value associated with

Representations That Guide Action Selection

19

B) Internally generated item

A) Externally presented item

delay

jacket

...

delay

pear apple vest

dela y

pear apple vest

dela y

1.4

jacket WM units response units

lexical units semantic feature units

internal reinforcement

Context units

Lexical/ Other semantic input/output cortical modules module

MTL memory module

PFC module

...

WM units response units

lexical units semantic feature units

internal reinforcement

Context units

Lexical/ Other semantic input/output cortical modules module

MTL memory module

PFC module

Figure 1.11 Becker and Lim’s architecture for modeling the frontal control of memory retrieval. The model operated in two diﬀerent modes: (A) During perception of an external stimulus (during a study phase) there was bottom-up ﬂow of activation. (B) During free recall, when a response was generated internally, there was a top-down ﬂow of activation in the model. After an item was retrieved, but before a response was generated, the item was used to probe the MTL memory system, and its recency was evaluated. If the recency (based on a match of the item to the memory weight matrix) was too high, the item was considered to be a repetition error, and if too low, it was considered to be an extralist intrusion error. Errors detected by the model were not generated as responses, but were used to generate internal reinforcement signals for learning the PFC module weights. Occasionally, a repetition or intrusion error might go undetected by the model, resulting in a recall error.

that response, and response probabilities are calculated directly from these Qvalues. Learning the memory retrieval strategy involved adapting the weights for the response units so as to maximize their associated Q-values. Reinforcement obtained on a given trial was self-generated by an internal evaluation module, so that the PFC module received a reward whenever a nonrepeated study list item was retrieved, and a punishment signal (negative reinforcement) when a nonlist or repeated item was retrieved. The model thereby learned to develop retrieval strategies dynamically in the course of both study and free recall of words. The model was able to capture the performance of human subjects with both intact and lesioned frontal lobes on a variety of types of word lists, in terms of both recall accuracy and patterns of errors. The model just described addresses a rather high level of complex action selection, namely, the selection of memory retrieval strategies. Most work on modeling action selection has dealt with more concrete and observable actions such as the choice of lever-presses in a response box or choice of body-turn directions in a maze. The advantage of this level of modeling is that it can make contact with a large body of experimental literature on animal behavior, pharmacology, and physiology. Many such models have employed TD-learning or Q-learning, under the assumption that animals form internal representations of value functions, which

Modeling the Mind: From Circuits to Systems

Terminal

A) MDP for modeling the T-maze

0

0s

0s Left Right Do nothing

4 Food

2

0s 0

0s 0s

Wall

Food

0 Start

B) T-maze with barrier on left arm

C) Model choice behavior vs. DA level 4

start

Expected reward (pellets)

20

Left Right

3

2

1

0 1

0.75

0.5

0.25

0

DA

Simulation of Cousins et al.’s T-maze cost-beneﬁt task. The MDP representation of the task is shown in panel A, the T-maze with a barrier and larger reward in the left arm of the maze is shown in panel B, and the performance of the model as a function of dopamine depletion is shown in panel C.

Figure 1.12

guide action selection. As mentioned above, phasic ﬁring of dopamine neurons has been postulated to convey the TD-error signal critical for this type of learning. However, in addition to its importance in modulating learning, dopamine plays an important role in modulating action choice. It has been hypothesized that tonic levels of dopamine have more to do with motivational value, whereas the phasic ﬁring of dopamine neurons conveys a learning-related signal (Smith et al., 2005). Rather than assuming that actions are solely guided by value functions, Smith et al. (2005) hypothesized that animals form detailed internal models of the world. Value functions condense the reward value of a series of actions into that of a single state, and are therefore insensitive to the motivational state of the animal (e.g., whether it is hungry or not). Internal models, on the other hand, allow a mental simulation of alternative action choices, which may result in qualitatively diﬀerent rewards. For example, an animal might perform one set of actions leading to water only if it is thirsty, and another set of actions leading to food only if it is hungry. The internal model can be described by a Markov decision process (MDP) over a set of internal states, with associated transition function and reward function, as in ﬁg 1.12A). The transition function and (immediate) reward value of each

1.5

New Directions: Integrating Multiple Memory Systems

21

state are learned through trial and error. Once the model is fully trained, action selection involves simulating a look-ahead process in the internal model for one or more steps in order to evaluate the consequences of an action. Finally, at the end of the simulation sequence, the animal’s internal model reveals whether the outcome is favorable (leads to reward) or not. An illustrative example is shown in ﬁg 1.12B. The choice faced by the animal is either to take the right arm of the T-maze to receive a small reward, or to take the left arm and then jump over a barrier to receive a larger reward. The role of tonic dopamine in this model is to modulate the eﬃcacy of the connections in the internal model. Thus, when dopamine is depleted, the model’s ability to simulate the look-ahead process to assess expected future reward will be biased toward rewards available immediately rather than more distal rewards. This implements an online version of temporal discounting. Cousins et al. (1996) found that normal rats trained in the T-maze task in ﬁg 1.12B) are willing to jump the barrier to receive a larger food reward nearly 100% of the time. Interestingly, however, when rats were administered a substance that destroys dopaminergic (DA) projections to the nucleus accumbens (DA lesion), they chose the smaller reward. In another version of the task, rats were trained on the same maze except that there was no food in the right arm, and then when given DA lesions, they nearly always chose the left arm and jumped the barrier to receive a reward. Thus, the DA lesion was not merely disrupting motor behavior, it was interacting with the motivational value of the behavioral choices. Note that the TD-error account of dopamine only provides for a role in learning, and would have nothing to say about eﬀects of dopamine on behavior subsequent to learning. Smith et al. (2005) argued, based on these and other data, that dopamine serves to modulate the motivational choice of the animals, with high levels of dopamine favoring the selection of action sequences with more distal but larger rewards. In simulations of the model, depletion of dopamine therefore biases the choice in favor of the right arm in this task, as shown in ﬁg 1.12C).

1.5

New Directions: Integrating Multiple Memory Systems In this chapter, we have reviewed several approaches to modeling the mind, from low-level sensory coding, to high-level memory systems, to action selection. Somehow, the brain accomplishes all of these functions, and it is highly unlikely that they are carried out in isolation from one another. For example, we now know that striatal dopaminergic pathways, presumed to carry a reinforcement learning signal, aﬀect sensory coding even in early sensory areas such as primary auditory cortex (Bao et al., 2001). Future work must address the integration of these various levels of modeling.

2

Empirical Statistics and Stochastic Models for Visual Signals

David Mumford

Bayes’s rule

The formulation of the vision problem as a problem in Bayesian inference (Forsyth and Ponce, 2002; Mumford, 1996, 2002) is, by now, well known and widely accepted in the computer vision community. In fact, the insight that the problem of reconstructing 3D information from a 2D image is ill posed and needs inference can be traced back to the Arab scientist Ibn Al-Haytham (known to Europe as Alhazan) around the year 1000 (Haytham, c. 1000). Inheriting a complete hodgepodge of conﬂicting theories from the Greeks,1 Al-Haytham for the ﬁrst time demonstrated that light rays originated only in external physical sources, and moved in straight lines, reﬂecting and refracting, until they hit the eye; and that the resulting signal needed to be and was actively decoded in the brain using a largely unconscious and very rapid inference process based on past visual experiences. In the modern era, the inferences underlying visual perception have been studied by many people, notably H. Helmholtz, E. Brunswik (Brunswik, 1956), and J. J. Gibson. In mathematical terms, the Bayesian formulation is as follows: let I be the observed image, a 2D array of pixels (black-and-white or colored or possibly a stereoscopic pair of such images). Here we are assuming a static image.2 Let w stand for variables that describe the external scene generating the image. Such variables should include depth and surface orientation information (Marr’s 2.5 D sketch), location and boundaries of the principal objects in view, their surface albedos, location of light sources, and labeling of object categories and possibly object identities. Then two stochastic models, learned from past experience, are required: a prior model p(w) specifying what scenes are likely in the world we live in and an imaging model p(I|w) specifying what images should look like, given the scene. Then by Bayes’s rule: p(w|I) =

p(I|w)p(w) ∝ p(I|w)p(w). p(I)

Bayesian inference consists in ﬁxing the observed value of I and inferring that w equals that value which maximizes p(w|I) or equivalently maximizes p(I|w)p(w). This is a ﬁne general framework, but to implement or even test it requires (1) a

24

Empirical Statistics and Stochastic Models for Visual Signals

theory of stochastic models of a very comprehensive sort which can express all the complex but variable patterns which the variables w and I obey, (2) a method of learning from experience the many parameters which such theories always contain, and (3) a method of computing the maximum of p(w|I). This chapter will be concerned only with problem 1. Many critiques of vision algorithms have failed to allow for the fact that these are three separate problems: if 2 or 3, the methods, are badly implemented, the resulting problems do not imply that the theory itself (1) is bad. For example, very slow algorithms of type 3 may reasonably be used to test ideas of type 1. Progress in understanding vision does not require all these problems to be solved at once. Therefore, it seems to me legitimate to isolate problems of type 1. In the rest of this chapter, I will review some of the progress in constructing these models. Speciﬁcally, I will consider, in section 2.1, models of the empirical probability distribution p(I) inferred from large databases of natural images. Then, in section 2.2, I will consider the problem of the ﬁrst step in so-called intermediate vision: inferring the regions which should be grouped together as single objects or structures, a problem which includes segmentation and gestalt grouping, the basic grammar of image analysis. Finally in section 2.3, I look at the problem of priors on 2D shapes and the related problem of what it means for two shapes to be “similar”. Obviously, all of these are huge topics and I cannot hope to give a comprehensive view of work on any of them. Instead, I shall give my own views of some of the important issues and open problems and outline the work that I know well. As this inevitably emphasizes the work of my associates, I must beg indulgence from those whose work I have omitted.

2.1

Statistics of the Image Alone The most direct approach to studying images is to ask whether we can ﬁnd good models for images without any hidden variables. This means ﬁrst creating a large database of images I that we believe are reasonably random samples of all possible images of the world we live in. Then we can study this database with all the tools of statistics, computing the responses of various linear and nonlinear ﬁlters and looking at the individual and joint histograms of their values. “Nonlinear” should be taken in the broadest sense, including order statistics or topological analyses. We then seek to isolate the most important properties these statistics have and to create the simplest stochastic models p(I) that duplicate or approximate these statistics. The models can be further tested by sampling from them and seeing if the resulting artiﬁcial images have the same “look and feel” as natural images; or if not, what are the simplest properties of natural images that we have failed to capture. Another recent survey of such models is referred to (Lee et al., 2003b).

2.1

Statistics of the Image Alone

2.1.1

kurtosis

25

High Kurtosis As The Universal Clue To Discrete Structure

The ﬁrst really striking thing about ﬁlter responses is that they always have large kurtosis. It is strange that electrical engineers designing TV sets in the 1950s do not seem to have pointed this out and this fact ﬁrst appeared in the work of David Field (Field, 1987). By kurtosis, we mean the normalized fourth moment. If x is a random real number, its kurtosis is ¯)2 )2 . κ(x) = E((x − x ¯)4 )/E((x − x

stationary Markov process

Every normal variable has kurtosis 3; a variable which has no tails (e.g., uniformly distributed on an interval) or is bimodal and small at its mean tends to have kurtosis less than 3; a variable with heavy tails or large peak at its mean tends to have kurtosis larger than 3. The empirical result which is observed for images is that for any linear ﬁlter F with zero mean, the values x = (F ∗ I)(i, j) of the ﬁltered image follow a distribution with kurtosis larger than 3. The simplest case of this is the diﬀerence of adjacent pixel values, the discrete derivative of the image I. But it has been found (Huang, 2000) to hold even for random mean zero ﬁlters supported in an 8 × 8 window. This high kurtosis is shown in ﬁg 2.1, from the thesis of J. Huang (Huang, 2000). This data was extracted from a large database of high-resolution, fully calibrated images of cities and country taken in Holland by van Hateren (1998). It is important, when studying tails of distributions, to plot the logarithm of the probability or frequency, as in this ﬁgure, not the raw probability. If you plot probabilities, all tails look alike. But if you plot their logarithms, then a normal distribution becomes a 2 downward facing parabola (since log(e−x ) = −x2 ), so heavy tails appear clearly as curves which do not point down so fast. It is a well-known fact from probability theory that if Xt is a stationary Markov stochastic process, then the kurtosis of Xt − Xs being greater than 3 means that the process Xt has discrete jumps. In the case of vision, we have samples from an image I(s, t) depending on two variables rather than one and the zero-mean ﬁlter is a generalization of the diﬀerence Xt − Xs . Other signals generated by the world, such as sound or prices, are functions of one variable, time. A nice elementary statement of the link between kurtosis and jumps is given by the following result, taken from Mumford and Desolneux: Theorem 2.1 Let x be any real random variable which we normalize to have mean 0 and standard deviation 1. Then there is a constant c > 0 depending only on x such that if, for some n, x is the sum x = y1 + y2 + · · · + y n where the yi are independent and identically distributed, then

Prob max |yi | ≥ (κ(x) − 3)/2 ≥ c. i

26

Empirical Statistics and Stochastic Models for Visual Signals

2

3

0

2

−2

1

−4 0

log(pdf)

−6 −8 −10

−1

−2

−3

−12 −4

−14 −5

−16

−6

−18 −20 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−7 −5

−4

−3

−2

−1

0

1

2

3

standard deviation

Histograms of ﬁlter values from the thesis of J. Huang, using the van Hateren database. On the left, the ﬁlter is the diﬀerence of (a) horizontally adjacent pixels, and (b) of adjacent 2 × 2, (c) 4 × 4 and (d) 8 × 8 blocks; on the right, several random mean-zero ﬁlters with 8 × 8 pixel support have been used. The kurtosis of all these ﬁlter responses is between 7 and 15. Note that the vertical axis is log of the frequency, not frequency. The histograms on the left are displaced vertically for legibility and the dotted lines indicate one standard deviation. Figure 2.1

generalized Laplacian distribution

Bessel distribution

A striking application of this is to the stock market. Let x be the log price change of the opening and closing price of some stock. If we assume price changes are Markov, as many have, and use the experimental fact that price changes have kurtosis greater than 3, then it implies that stock prices cannot be modeled as a continuous function of time. In fact, in my own ﬁt of some stock-market data, I found the kurtosis of log price changes to be inﬁnite: the tails of the histogram of log price changes appeared to be polynomial, like 1/xα with α between 4 and 5. An important question is, How big are the tails of the histograms of image ﬁlter statistics? Two models have been proposed for these distributions. The ﬁrst is the most commonly used model, the generalized Laplacian distribution:

b b e−|x/a| , Z = e−|y/a| dy. plaplace (x) = Z Here a is a scale parameter and b controls how large the tails are (larger tails for smaller b). Experimentally, these work well and values of b between 0.5 and 1 are commonly found. However, no rationale for their occurrence seems to have been found. The second is the Bessel distribution (Grenander and Srivastava, 2001; Wainwright and Simoncelli, 2000): pbessel (x) = q(ξ),

q(ξ) = 1/(1 + (aξ 2 ))b/2 .

Again, a is a scale parameter, b controls the kurtosis (as before, larger kurtosis for smaller b), and the hat means Fourier transform. pbessel (x) can be evaluated

4

5

2.1

Statistics of the Image Alone

27

explicitly using Bessel functions. The tails, however, are all asymptotically like those of double exponentials e−|x/a| , regardless of b. The key point is that these distributions arise as the distributions of products r ·x of Gaussian random variables x and an independent positive “scaling” random variable r. For some values of b, the variable r is distributed like x for a Gaussian x ∈ Rn , but in general its square has a gamma (or chi-squared) distribution. The great appeal of such a product is that images are also formed as products, especially as products of local illumination, albedo, and reﬂectance factors. This may well be the deep reason for the validity of the Bessel models. Convincing tests of which model is better have not been made. The diﬃculty is that they diﬀer most in their tails, where data is necessarily very noisy. The best approach might be to use the Kolmogorov-Smirnov statistic and compare the best-ﬁtting models for this statistic of each type. The world seems to be composed of discrete jumps in time and discrete objects in space. This profound fact about the physical nature of our world is clearly mirrored in the simple statistic kurtosis. 2.1.2

Scaling Properties of Images and Their Implications

scale invariance After high kurtosis, the next most striking statistical property of images is their approximate scale invariance. The simplest way to deﬁne scale invariance precisely is this: imagine we had a database of 64 × 64 images of the world and that this could be modeled by a probability distribution p64 (I) in the Euclidean space R4096 of all such images. Then we can form marginal 32 × 32 images in two diﬀerent ways: we either extract the central 32 × 32 set of pixels from the big image I or we cover the whole 64 × 64 image by 1,024 2 × 2 blocks of pixels and average each such block to get a 32 × 32 image (i.e., we “blow down” I in the crudest way). The assertion that images are samples from a scale-invariant distribution is that the two resulting marginal distributions on 32 × 32 images are the same. This should happen for images of any size and we should also assume that the distribution is stationary, i.e., translating an image gives an equally probable image. The property is illustrated in ﬁg 2.2. It is quite remarkable that, to my knowledge, no test of this hypothesis on reasonably large databases has contradicted it. Many histograms of ﬁlter responses on successively blown down images have been made; order statistics have been looked at; and some topological properties derived from level curves have been studied (Geman and Koloydenko, 1999; Gousseau, 2000; Huang, 2000; Huang and Mumford, 1999). All have shown approximate scale invariance. There seem to be two simple facts about the world which combine to make this scale invariance approximately true. The ﬁrst is that images of the world are taken from random distances: you may photograph your spouse’s face from one inch away or from 100 meters away or anything in between. On your retina, except for perspective distortions, his or her image is scaled up or down as you move closer or farther away. The second is that objects tend to have surfaces on which smaller objects

28

Empirical Statistics and Stochastic Models for Visual Signals

20

40

60

80

100

120 20

40

60

80

100

120

10

10

20

20

30

30

40

40

50

50

60

60 10

20

30

40

50

60

10

20

30

40

50

60

Scale invariance deﬁned as a ﬁxed point under block renormalization. The top is random 2N × 2N image which produces the two N × N images on the bottom, one by extracting a subimage, the other by 2 × 2 block averaging. These two should have the same marginal distributions. (Figure from A. Lee.)

Figure 2.2

power law

cluster: your body has limbs which have digits which have hairs on them, your oﬃce has furniture which has books and papers which have writing (a limiting case of very ﬂat objects on its surface) on them, etc. Thus a blowup of a photograph not only shows roughly the same number of salient objects, but they occur with roughly the same contrast.3 The simplest consequence of scale invariance is the law for the decay of power at high frequencies in the Fourier transform of images (or better, the discrete cosine transform to minimize edge eﬀects). It says that the expected power as a function of frequency should drop oﬀ like

ˆ η)|2 ≈ C/(ξ 2 + η 2 ) = C/f 2 , EI |I(ξ, where f = ξ 2 + η 2 is the spatial frequency. This power law was discovered in the 1950s. In the image domain, it is equivalent to saying that the autocorrelation of the image is approximated by a constant minus log of the distance: ¯ ¯ (I(x, y) − I).(I(x + a, y + b) − I) ≈ C − log( a2 + b2 ). EI x,y

Note that the models have both infrared4 and ultraviolet divergences: the total

2.1

Statistics of the Image Alone

29

power diverges for both f → 0 and ∞, and the autocorrelation goes to ±∞ as a, b → 0, and ∞. Many experiments have been made testing this law over moderate ranges of frequencies and I believe the conclusion to draw is this: for small databases of images, especially databases of special sorts of scenes such as forest scenes or city scenes, diﬀerent powers are found to ﬁt best. These range from 1/f 3 to 1/f but with both a high concentration near 1/f 2 and a surprisingly large variance5 (Frenkel et al.; Huang, 2000). But for large databases, the rule seems to hold. Another striking consequence of the approximate scale-invariance is that images, if they have inﬁnitely high resolution, are not functions at all but must be considered generalized functions (distributions in the sense of Schwartz). This means that as their resolution increases, natural images do not have deﬁnite limiting numerical values I(x, y) at almost all points x, y in the image plane. I think of this as the “mites on your eyelashes” theorem. Biologists tell us that such mites exist and if you had Superman’s X-ray vision, you not only could see them but by the laws of reﬂectance, they would have high contrast, just like macroscopic objects. This mathematical implication is proven by Gidas and Mumford (Gidas and Mumford, 2001). This conclusion is quite controversial: others have proposed other function spaces as the natural home for random images. An early model for images (Mumford and Shah, 1989) proposed that observed images were naturally a sum: I(x, y) = u(x, y) + v(x, y), where u was a piecewise smooth “cartoon”, representing the important content of the image, and v was some L2 noise. This led to the idea that the natural function space for images, after the removal of noise, was the space of functions of bounded variation, i.e., ||∇I||dxdy < ∞. However, this approach lumped texture in with noise and results in functions u from which all texture and ﬁne detail has been removed. More recent models, therefore, have proposed that I(x, y) = u(x, y) + v(x, y) + w(x, y), where u is the cartoon, v is the true texture, and w is the noise. The idea was put forward by DeVore and Lucier (1994) that the true image u+v belongs to a suitable Besov space, spaces of functions f (x, y) for which bounds are put on the Lp norm of f (x + h, y + k) − f (x, y) for (h, k) small. More recently, Carasso has simpliﬁed their approach (Carasso, 2004) and hypothesizes that images I, after removal of “noise” should satisfy

|I(x + h, y + k) − I(x, y)|dxdy < C(h2 + k 2 )α/2 , for some α as (h, k) → 0. However, a decade ago, Rosenfeld argued with me that most of what people discard as “noise” is nothing but objects too small to be fully resolved by the resolution of the camera and thus blurred beyond recognition or even aliased. I think

30

Empirical Statistics and Stochastic Models for Visual Signals

This photo is intentionally upside-down, so you can look at it more abstractly. The left photo has a resolution of about 500 × 500 pixels and the right photo is the yellow 40 × 40 window shown on the left. Note (a) how the distinct shapes in the road made by the large wet/dry spots gradually merge into dirt texture and (b) the way on the right the bush is pure noise. If the bush had moved relative to the pixels, the pattern would be totally diﬀerent. There is no clear dividing line between distinct objects, texture, and noise. Even worse, some road patches which ought to be texture are larger than salient objects like the dog.

Figure 2.3

random model

wavelet

of this as clutter. The real world is made up of objects plus their parts and surface markings of all sizes and any camera resolves only so many of these. There is an ideal image of inﬁnite resolution but any camera must use sensors with a positive point spread function. The theorem above says that this ideal image, because it carries all this detail, cannot even be a function. For example, it has more and more high-frequency content as the sensors are reﬁned and its total energy diverges in the limit,6 hence it cannot be in L2 . In ﬁg 2.3, we illustrate that there is no clear dividing line between objects, texture, and noise: depending on the scale at which you view and digitize the ideal image, the same “thing” may appear as an object, as part of a texture, or as just a tiny bit of noise. This continuum has been analyzed beautifully recently by Wu et al. (2006, in revision). Is there is a simple stochastic model for images which incorporates both high kurtosis and scale invariance? There is a unique scale-invariant Gaussian model, namely colored white noise whose expected power spectrum conforms to the 1/f 2 law. But this has kurtosis equal to 3. The simplest model with both properties seems to be that proposed and studied by Gidas and me (Gidas and Mumford, 2001), which we call the random wavelet model. In this model, a random image is a countable sum: ψα (erα x − xα , erα y − yα ). I(x, y) = α

2.1

Statistics of the Image Alone

31

Here (rα , xα , yα ) is a uniform Poisson process in 3-space and ψα are samples from the auxiliary Levy process, a distribution on the space of scale- and positionnormalized elementary image constituents, which one may call mother wavelets or textons. These expansions converge almost surely in all the Hilbert-Sobolev spaces H − . Each component ψα represents an elementary constituent of the image. Typical choices for the ψ’s would be Gabor patches, edgelets or curvelets, or more complex shapes such as ribbons or simple shapes with corners. We will discuss these in section 2.1.4 and we will return to the random wavelet model in section 2.2.3. 2.1.3

Occlusion and the “Dead Leaves” Model

There is, however, a third basic aspect of image statistics which we have so far not considered: occlusion. Images are two-dimensional projections of the threedimensional world and objects get in front of each other. This means that it is a mathematical simpliﬁcation to imagine images as sums of elementary constituents. In reality, objects are ordered by distance from the lens and they should be combined by the nonlinear operation in which nearer surface patches overwrite distant ones. Statistically, this manifests itself in a strongly non-Markovian property of images: suppose an object with a certain color and texture is occluded by a nearer object. Then, on the far side of the nearer object, the more distant object may reappear, hence its color and texture have a larger probability of occurring than in a Markov model. This process of image construction was studied by the French school of Math´ eron and Serra based at the Ecole des Mines (Serra, 1983 and 1988). Their “dead leaves model” is similar to the above random wavelet expansion except that occlusion is used. We imagine that the constituents of the image are tuples (rα , xα , yα , dα , Dα , ψα ) where rα , xα and yα are as before, but now dα is the distance from the lens to the αth image patch and ψα is a function only on the set of (x, y) ∈ Dα . We make no a priori condition on the density of the Poisson process from which (rα , xα , yα , dα ) is sampled. The image is then given by I(x, y) = ψα(x,y) (erα(x,y) x − xα(x,y) , erα(x,y) y − yα(x,y) ), α(x, y) = argmin{dα (x, y) ∈ Dα }

where

This model has been analyzed by A. Lee, J. Huang and myself (Lee et al., 2001) but has more serious infrared and ultraviolet catastrophes than the additive one. One problem is that nearby small objects cause the world to be enveloped in a sort of fog occluding everything in the distance. Another is the probability that one big nearby object occludes everything. In any case, with some cutoﬀs, Lee’s models are approximately scale-invariant and seem to reproduce all the standard elementary image statistics better than any other that I know of, e.g., two-point co-occurrence statistics as well as joint wavelet statistics. Examples of both types of models are shown in ﬁg. 2.4. I believe a deeper analysis of this category of models entails modeling directly,

32

Empirical Statistics and Stochastic Models for Visual Signals

20

20

40

40

60

60

80

80

100

100

120

120

140

140

160

160

180

180 200

200 50

100

150

200

50

100

150

200

Synthetic images illustrating the generic image models from the text. On the left, a sample dead leaves model using disks as primitives; on the right, a random wavelet model whose primitive are short ribbons.

Figure 2.4

not the objects in 2D projection, but their statistics in 3D. What is evident then is that objects are not scattered in 3-space following a Poisson process, but rather are agglutinative: smaller objects collect on or near the surface of bigger objects (e.g., houses and trees on the earth, limbs and clothes on people, buttons and collars on clothes, etc.). The simplest mathematical model for this would be a random branching process in which an object had “children”, which were the smaller objects clustering on its surface. We will discuss a 2D version of this in section 2.2.3. 2.1.4

texton

The Phonemes Of Images

The ﬁnal component of this direct attack on image statistics is the investigation of its elementary constituents, the ψ above. In analogy with speech, one may call these constituents phonemes (or phones). The original proposals for such building blocks were given by Julesz and Marr. Julesz was interested in what made two textures distinguishable or indistinguishable. He proposed that one should break textures locally into textons (Julesz, 1981; Resnikoﬀ, 1989) and, supported by his psychophysical studies, he proposed that the basic textons were elongated blobs and their endpoints (“terminators”). Marr (1982), motivated by the experiments of Hubel and Wiesel on the responses of cat visual cortex neurons, proposed that one should extract from an image its “primal sketch”, consisting of edges, bars, and blobs. Linking these proposals with raw image statistics, Olshausen and Field (1996) showed that simple learning rules seeking a sparse coding of the image, when exposed to small patches from natural images, did indeed develop responses sensitive to edges, bars, and blobs. Another school of researchers have taken the elegant mathematical theory of wavelets and sought to ﬁnd those wavelets which enabled best image compression. This has been pursued especially by Mallat (1999), Simoncelli (1999), and Donoho and their collaborators (Candes and Donoho, 2005). Having large natural image databases and powerful computers, we can ask now

2.2

Grouping of Image Structures

33

for a direct extraction of these or other image constituents from a statistical analysis of the images themselves. Instead of taking psychophysical, neurophysiological, or mathematical results as a basis, what happens if we let images speak for themselves? Three groups have done this: Geman-Koloydenko (Geman and Koloydenko, 1999), Huang-Lee-Pedersen-Mumford (Lee et al., 2003a), and Malik-Shi (Malik et al., 1999). Some of the results of Huang and of Malik et al. are shown in ﬁg. 2.5. The approach of Geman and Koloydenko was based on analyzing all 3 × 3 image patches using order statistics. The same image patches were studied by Lee and myself using their real number values. A very similar study by Pedersen and Lee (2002) replaced the nine pixel values by nine Gaussian derivative ﬁlter responses. In all three cases, a large proportion of such image patches were found to be either low contrast or high contrast cut across by a single edge. This, of course, is not a surprise, but it quantiﬁes the signiﬁcance of edges in image structure. For example, in the study by Lee, Pedersen and myself, we took the image patches with the top 20% quantile for contrast, then subtracted their mean and divided by their standard deviation, obtaining data points on a seven-dimensional sphere. In this sphere, there is a surface representing the responses to image patches produced by imaging straight edges with various orientations and oﬀsets. Close analysis shows that the data is highly concentrated near this surface, with asymptotic inﬁnite density along the surface itself. Malik and Shi take small patches and analyze these by a ﬁlter bank of 36 wavelet ﬁlters. They then apply k-means clustering to ﬁnd high-density points in this point cloud. Again the centers of these clusters resemble the traditional textons and primitives. In addition, they can adapt the set of textons they derive to individual images, obtaining a powerful tool for representing a single image. A deﬁnitive analysis of images deriving directly the correct vocabulary of basic image constituents has not been made but the outlines of the answer are now clear.

2.2

Grouping of Image Structures In the analysis of signals of any kind, the most basic “hidden variables” are the labels for parts of the signal that should be grouped together, either because they are homogeneous parts in some sense or because the components of this part occur together with high frequency. This grouping process in speech leads to words and in language leads to the elements of grammar—phrases, clauses, and sentences. On the most basic statistical level, it seeks to group parts of the signal whose probability of occurring together is signiﬁcantly greater than it would be if they were independent: see section 2.2.3 for this formalism. The factors causing grouping were the central object of study for the Gestalt school of psychology. This school ﬂourished in Germany and later in Italy in the ﬁrst half of the twentieth century and included M. Wertheimer, K. Koﬀka, W. Metzger, E. Brunswik, G. Kanizsa, and many others. Their catalog of features which promoted grouping included

34

Empirical Statistics and Stochastic Models for Visual Signals

0.012

0.012

0.011

0.011

0.011

0.011

0.0088

0.0087

0.0087

0.0087

0.0087

0.0086

0.01

0.01

0.01

0.01

0.01

0.01

0.0086

0.0086

0.0086

0.0085

0.0085

0.0085

0.01

0.0099

0.0097

0.0097

0.0097

0.0097

0.0085

0.0084

0.0084

0.0084

0.0083

0.0083

0.0097

0.0096

0.0096

0.0096

0.0092

0.0092

0.0083

0.0082

0.0082

0.0082

0.0081

0.008

0.0092

0.0092

0.0092

0.0091

0.0091

0.0091

0.008

0.008

0.008

0.0079

0.0079

0.0078

0.009

0.009

0.009

0.009

0.0089

0.0089

0.0078

0.0078

0.0078

0.0078

0.0077

0.0077

0.0077

0.0077

0.0076

0.0075

0.0074

0.0073

0.0072

0.0072

0.0071

0.007

0.007

0.0069

0.0068

0.0068

0.0068

0.0068

0.0068

0.0068

0.006

0.006

0.006

0.0059

0.0059

0.0057

0.0068

0.0067

0.0067

0.0066

0.0066

0.0066

0.0057

0.0056

0.0056

0.0056

0.0053

0.0052

0.0065

0.0065

0.0065

0.0064

0.0063

0.0063

0.0052

0.0051

0.005

0.005

0.005

0.0048

0.0062

0.0062

0.0062

0.0061

0.0061

0.006

0.0046

0.003

(a)

(b)

Textons derived by k-means clustering applied to 8 × 8 image patches. On the top, Huang’s results for image patches from van Hateren’s database; on the bottom, Malik et al.’s results using single images and ﬁlter banks. Note the occasional terminators in Huang’s results, as Julesz predicted.

Figure 2.5

2.2

Grouping of Image Structures

35

color and proximity, alignment, parallelism, and symmetry, closedness and convexity. Kanizsa was well aware of the analogy with linguistic grammar, titling his last book Grammatica del Vedere (Kanizsa, 1980). But they had no quantitative measures for the strength of these grouping principles, as they well knew. This is similar to the situation for traditional theories of human language grammar—a good story to explain what words are to be grouped together in phrases but no numbers. The challenge we now face is to create theories of stochastic grammars which can express why one grouping is chosen in preference to another. It is a striking fact that, faced either with a sentence or a scene of the world, human observers choose the same groupings with great consistency. This is in contrast with computers which, given only the grouping rules, ﬁnd thousands of strange parses of both sentences and images. 2.2.1 grouping

The Most Basic Grouping: Segmentation and Texture

The simplest grouping rules are those of similar color (or brightness) and proximity. These two rules have been used to attack the segmentation problem. The most naive but direct approach to image segmentation is based on the assumption that images break up into regions on which their intensity values are relatively constant and across whose boundaries those values change discontinuously. A mathematical version of this approach, which gives an explicit measure for comparing diﬀerent proposed segmentations, is the energy functional proposed by Shah and myself (Mumford and Shah, 1989). It is based on a model I = u + v where u is a simpliﬁed cartoon of the image and v is “noise”:

∇u 2 + C3 · length(Γ), where E(I, u, Γ) = C1 (I − u)2 + C2 D

D−Γ

D = domain of I, Γ = boundaries of regions which are grouped together, and Ci = parameters to be learned. In this model, pixels in D−Γ have been grouped together by stringing together pairs of nearby similarly colored pixels. Diﬀerent segmentations correspond to choosing diﬀerent u and Γ and the one with lower energy is preferred. Using the Gibbs statistical mechanics approach, this energy can be thought of as a probability: heuristically, we set p(I, u, Γ) = e−E(I,u,Γ)/T /Z, where T and Z are constants. Taking this point of view, the ﬁrst term in E is equivalent to assuming v = I − u is a sample from white noise. Moreover, if Γ is ﬁxed, then the second term in E makes u a sample from the scale-invariant Gaussian distribution on functions, suitably adapted to the smaller domain D − Γ. It is hard to interpret the third term even heuristically, although Brownian motion ((x(t), y(t)) is heuristically a sample

36

Empirical Statistics and Stochastic Models for Visual Signals

2

2

from the prior e− (x (t) +y (t) )dt , which, if we adopt arc length parameterization, becomes e−length(Γ) . If we stay in the discrete pixel setting, the Gibbs model corresponding to E makes good mathematical sense; it is a variant of the Ising model of statistical mechanics (Blake and Zisserman, 1987; Geman and Geman, 1984). The most obvious weakness in this model is its failure to group similarly textured regions together. Textural segmentation is an example of the hierarchical application of gestalt rules: ﬁrst the individual textons are grouped by having similar colors, orientations, lengths, and aspect ratios. Then these groupings of textons are further grouped into extended textured regions with homogeneous or slowly varying “texture”. Ad hoc adaptations of the above energy approach to textural grouping (Geman and Graﬃgne, 1986; Hofmann et al., 1998; Lee et al., 1992) have been based on choosing some ﬁlter bank the similarity of whose responses are taken as a surrogate for the ﬁrst low-level texton grouping. One of the problems of this approach is that textures are often not characterized so much by an average of all ﬁlter responses as by the very large response of one particular ﬁlter, especially by the outliers occurring when this ﬁlter precisely matches a texton (Zhu et al., 1997). A careful and very illuminating statistical analysis of the importance of color, textural, and edge features on grouping, based on human segmented images, was given by Malik’s group (Foulkes et al., 2003). 2.2.2

Extended Lines and Occlusion

The most striking demonstrations of gestalt laws of grouping come from occlusion phenomena, when edges disappear behind an object and reappear. A typical example is shown in ﬁg 2.6. The most famous example is the so-called Kanizsa triangle, where, to further complicate matters, the foreground triangle has the same color as the background with only black circles of intermediate depth being visible. The grouping laws lead one to infer the presence of the occluding triangle and the completion of the three partially occluded black circles. An amusing variant, the Kanizsa pear, is shown in the same ﬁgure. These eﬀects are not merely psychophysical curiosities. Virtually every image of the natural world has major edges which are occluded one or more times by foreground objects. Correctly grouping these edges goes a long way to ﬁnding the correct parse of an image. A good deal of modeling has gone into the grouping of disconnected edges into extended edges and the evaluation of competing groupings by energy values or probabilities. Pioneering work was done by Parent and Zucker (1989) and Shashua and Ullman (1988). Nitzberg, Shiota, and I proposed a model for this (Nitzberg et al., 1992) which was a small extension of the Mumford-Shah model. The new energy involves explicitly the overlapping regions Rα in the image given by the 3D objects in the scene, both the visible and the occluded parts of these objects. Therefore, ﬁnding its minimum involves inferring the occluded parts of the visible objects as well as the boundaries of their visible parts. (These are literally “hidden

2.2

Grouping of Image Structures

37

Two examples of gestalt grouping laws: on the left, the black bars are continued under the white blob to form the letter T, on the right, the semicircles are continued underneath a foreground “pear” which must completed by contours with zero contrast.

Figure 2.6

variables”.) Moreover, we need the depth order of the objects—which are nearer, which farther away. The cartoon u of the image is now assumed piecewise constant, with value uα on the region Rα . Then,

2 C2 κ2∂Rα + C3 ds, C1 (I − uα ) + E(I, {uα }, {Rα }) = ⎛α Rα = ⎝Rα −

Rα

∂Rα

⎞

Rα ∩ Rβ ⎠ = visible part of Rα ,

nearer Rβ

κ∂Rα = curvature of ∂Rα . This energy allows one to quantify the application of gestalt rules for inferring occluded objects and predicts correctly, for example, the objects present in the Kanizsa triangle. The minima of this E will infer speciﬁc types of hidden contours, namely contours which come from the purely geometric variational problem of minimizing a sum of squared curvature and arc length along an unknown curve. This variational problem was ﬁrst formulated by Euler, who called the resulting curves elastica. To make a stochastic model out of this, we need a stochastic model for the edges occurring in natural images. There are two parts to this: one is modeling the local nature of edges in images and the other is modeling the way they group into extended curves. Several very simple ideas for modeling curves locally, based on Brownian motion, were proposed in Mumford (1992). Brownian paths themselves are too jagged to be suitable, but one can assume the curves are C 1 and that their orientation θ(s), as a function of arc length, is Brownian. Geometrically, this is like saying their curvature is white noise. Another alternative is to take 2D projections of 3D curves whose direction of motion, given by a map from arc length to points

38

Empirical Statistics and Stochastic Models for Visual Signals

on the unit sphere, is Brownian. Such curves have more corners and cusps, where the 3D path heads toward or away from the camera. Yet another option is to generate parameterized curves whose velocity (x (t), y (t)) is given by two OrnsteinUhlenbeck processes (Brownian functions with a restoring force pulling them to 0). These paths have nearly straight segments when the velocity happens to get large. A key probability distribution in any such theory is p(x, y, θ), the probability density that if an image contour passes through (0, 0) with horizontal tangent, then this contour will also pass through (x, y) with orientation θ. This function has been estimated from image databases in (Geisler et al., 2001), but I do not know of any comparison of their results with mathematical models. Subsequently, Zhu (1999) and Ren and Malik (2002) directly analyzed edges and their curvature in hand-segmented images. Zhu found a high-kurtosis empirical distribution much like ﬁlter responses: a peak at 0 showing the prevalence of straight edges and large tails indicating the prevalence of corners. He built a stochastic model for polygonal approximations to these curves using an exponential model of the form p(Γ) ∝ e−

Γ

ψ1 (κ(s))+ψ2 (κ (s))ds

,

where κ is the curvature of Γ and the ψi are unknown functions chosen so that the model yields the same distribution of κ, κ as that found in the data. Finding continuum limits of his models under weak convergence is an unsolved problem. Ren and Malik’s models go beyond the previous strictly local ones. They are k th order Markov models in which the orientation θk+1 of a curve at a sample point Pk+1 is a sample from a joint probability distribution of the orientations θkα of both the curve and smoothed versions of itself at other scales α, all at the previous point Pk . A completely diﬀerent issue is ﬁnding probabilities that two edges should be joined, e.g., if Γ1 , Γ2 are two curves ending at points P1 , P2 , how likely is it that in the real world there is a curve Γh joining P1 and P2 and creating a single curve Γ1 ∪ Γh ∪ Γ2 ? This link might be hidden in the image because of either occlusion, noise or low contrast (anyone with experience with real images will not be surprised at how often this happens). Jacobs, Williams, Geiger, and others have developed algorithms of this sort based on elastica and related ideas (Geiger et al., 1998; Williams and Jacobs, 1997). Elder and Goldberg (2002) and Geisler et al. (2001) have carried out psychophysical experiments to determine the eﬀects of proximity, orientation diﬀerence, and edge contrast on human judgments of edge completions. One of the subtle points here (as Ren and Malik make explicit) is that this probability does not depend only on the endpoints Pi and the tangent lines to the Γi at these points. So, for instance, if Γ1 is straight for a certain distance before its endpoint P1 , then the longer this straight segment is, the more likely it is that any continuation it has will also be straight. An elegant analysis of the situation purely for straight edges has been given by Desolneux et al. (2003). It is based on what they call maximally meaningful alignments, which come from computing

2.2

Grouping of Image Structures

39

Figure 2.7 An experiment ﬁnding the prostate in a MRI scan (from August (2002)). On the left, the raw scan; in the middle, edge ﬁlter responses; on the right, the computed posterior of August’s curve indicator random ﬁeld, (which actually lives in (x, y, θ) space, hence the boundary of the prostate is actually separated from the background noise).

the probabilities of accidental alignments and no other prior assumptions. The most compelling analysis of the problem, to my mind, is that in the thesis of Jonas August (August, 2001). He starts with a prior on a countable set of true curves, assumed to be part of the image. Then he assumes a noisy version of this is observed and seeks the maximally probable reconstruction of the whole set of true curves. An example of his algorithms is shown in ﬁg 2.7. Another algorithm for global completion of all image contours has been given recently by Malik’s group (Ren et al., 2005). 2.2.3

Mathematical Formalisms for Visual Grammars

The “higher level” Gestalt rules for grouping based on parallelism, symmetry, closedness, and convexity are even harder to make precise. In this section, I want to describe a general approach to these questions. So far, we have described grammars loosely as recursive groupings of parts of a signal, where the signal can be a string of phonemes or an image of pixels. The mathematical structure which these groupings deﬁne is a tree: each subset of the domain of the image which is grouped together deﬁnes a node in this tree and, whenever one such group contains another, we join the nodes by an edge. In the case of sentences in human languages, this tree is called the parse tree. In the case of images, it is similar to the image pyramid made up of the pixels of the image plus successively “blowndown” images 2n times smaller. However, unlike the image pyramid, its nodes only stand for natural groupings, so its structure is adaptively determined by the image itself. To go deeper into the formalism of grammar, the next step is to label these groupings. In language, typical labels are “noun phrase”, “prepositional clause,” etc. In images, labels might be “edgelet,” “extended edge,” “ribbon,” “T-junction,” or

40

Empirical Statistics and Stochastic Models for Visual Signals

The parse tree for the letter A which labels the top node; the lower nodes might be labeled “edge” and “corner.” Note that in grouping the two sides, the edge has an attribute giving its length and approximate equality of the lengths of the sides must hold; and in the ﬁnal grouping, the bar of the A must meet the two sides in approximately equal angles. These are probabilistic constraints involving speciﬁc attributes of the constituents, which must be included in B .

Figure 2.8

even “the letter A.” Then the grouping laws are usually formulated as productions: noun phrase −→ determiner + noun extended edge −→ edgelet + extended edge

probabilistic context-free grammar

where the group is on the left and its constituents are shown on the right. The second rule creates a long edge by adding a small piece, an edgelet, to one end. But now the issue of agreement surfaces: one can say “a book” and “some books” but not “a books” or “some book.” The determiner and the noun must agree in number. Likewise, to group an edge with a new edgelet requires that the edgelet connect properly to the edge: where one ends, the other must begin. So we need to endow our labeled groupings with a list of attributes that must agree for the grouping to be possible. So long as we can do this, we have created a context-free grammar. Context-freeness means that the possibility of the larger grouping depends only on the labels and attributes of the constituents and nothing else. An example of the parse of the letter A is shown in ﬁg 2.8. We make the above into a probability model in a top-down generative fashion by assigning probabilities to each production. For any given label and attributes, the sum (or integral) of the probabilities of all possible productions it can yield should be 1. This is called a PCFG (probabilistic context-free grammar) by linguists. It is the same as what probabilists call a random branching tree (except that grammars are usually assumed to almost surely yield ﬁnite parse trees). A more general formalism for deﬁning random trees with random data attached to their nodes has been given by Artur Fridman (Fridman, 2003). He calls his models mixed Markov models because some of the nodes carry address variables whose value is the index of another node. Thus in each sample from the model, this node adds a new edge to the graph. His models include PCFGs as a special case.

2.2

Grouping of Image Structures

41

A simpliﬁcation of the parse tree inferred by the segmentation algorithm of Galun et al. (2003). The image is at the bottom and part of its tree is shown above it. On the right are shown some of the regions in the image, grouped by successive levels of the algorithm.

Figure 2.9

Random trees can be ﬁt naturally into the random wavelet model (or the dead leaves model) described above. To see this, we consider each 4-tuple {xα , yα , rα , ψα } in the model not merely as generating one elementary constituent of the image, but as the root of a whole random branching tree. The child nodes it generates should add parts to a now compound object, expanding the original simple image constituent ψα . For example the root might be an elongated blob representing the trunk of a person and the tree it generates would add the limbs, clothes, face, hands, etc., to the person. Or the root might be a uniform patch and the tree would add a whole set of textons to it, making it into a textured patch. So long as the rate of growth of the random branching tree is not too high, we still get a scale-invariant model. Two groups have implemented image analysis programs based on computing such trees. One is the multiscale segmentation algorithm of Galun, Sharon, Basri, and Brandt (Galun et al., 2003), which produces very impressive segmentation results. The method follows Brandt’s adaptive tree-growing algorithm called algebraic multi-grid. In their code, texture and its component textons play the same role as objects and their component parts: each component is identiﬁed at its natural scale and grouped further at a higher level in a similar way (see ﬁg 2.9). Their code is fully scale-invariant except at the lowest pixel level. It would be very interesting to ﬁt their scheme into the Bayesian framework. The other algorithm is an integrated bottom-up and top-down image parsing

42

Empirical Statistics and Stochastic Models for Visual Signals

program from Zhu’s lab (Tu et al., 2003). The output of their code is a tree with semantically labeled objects at the top, followed by parts and texture patches in the middle with the pixels at the bottom. This program is based on a full stochastic model. A basic problem with this formalism is that it is not suﬃciently expressive: the grammars of nature appear to be context sensitive. This is often illustrated by contrasting languages that have sentences of the form abcddcba, which can be generated recursively by a small set of productions as in s → asa → absba → abcscba → abcddcba, versus languages which have sentences of the form abcdabcd, with two complex repeating structures, which cannot be generated by simple productions. Obviously, images with two identical faces are analogs of this last sentence. Establishing symmetry requires you to reopen the grouped package and examine everything in it to see if it is repeated! Unless you imagine each label given a huge number of attributes, this cannot be done in a context-free setting. In general, two-dimensional geometry creates complex interactions between groupings, and the strength of higher-order groupings seems to always depend on multiple aspects of each piece. Take the example of a square. Ingredients of the square are (1) the two groupings of parallel edges, each made up of a pair of parallel sides of equal length and (2) the grouping of edgelets adjacent to each vertex into a “right-angle” group. The point is that the pixels involved in these smaller groupings partially intersect. In PCFGs, each group should expand to disjoint sets of primitives or to one set contained in another. The case of the square is best described with the idea of graph uniﬁcation, in which a grouping rule uniﬁes parts of the graph of parts under each constituent. S. Geman and his collaborators (Bienenstock et al., 1998; Geman et al., 2002) have proposed a general framework for developing such probabilistic contextsensitive grammars. He proposes that for grouping rule , in which groups y1 , y2 , · · · , yk are to be uniﬁed into a larger group x, there is a binding function B (y1 , y2 , · · · , yk ) which singles out those attributes of the constituents that aﬀect the probability of making the k-tuple of y’s into an x. For example, to put two edgelets together, we need to ask if the endpoint of the ﬁrst is near the beginning of the second and whether their directions are close. The closer are these points and directions, the more likely it is that the two edgelets should be grouped. The basic hypothesis is that the likelihood ratio p(x, y1 , · · · , yk )/ i p(yi ) depends only on B (y1 , · · · , yk ). In their theory, Geman and colleagues analyze how to compute this function from data. This general framework needs to be investigated in many examples to further constrain it. An interesting example is the recent work of Ullman and collaborators (Ullman et al., 2002) on face recognition, built up through the recognition of parts: this would seem to ﬁt into this framework. But, overall, the absence of mathematical theories which incorporate all the gestalt rules at once seems to me the biggest gap in our understanding of images.

2.3

2.3

Probability Measures on the Space of Shapes

43

Probability Measures on the Space of Shapes The most characteristic new pattern found in visual signals, but not in onedimensional signals, are shapes, two-dimensional regions in the domain of the image. In auditory signals, one has intervals on which the sound has a particular spectrum, for instance, corresponding to some speciﬁc type of source (for phonemes, some speciﬁc conﬁguration of the mouth, lips, and tongue). But an interval is nothing but a beginning point and an endpoint. In contrast, a subset of a two-dimensional region is much more interesting and conveys information by itself. Thus people often recognize objects by their shape alone and have a rich vocabulary of diﬀerent categories of shapes often based on prototypes (heart-shaped, egg-shaped, starshaped, etc.). In creating stochastic models for images, we must face the issue of constructing probability measures on the space of all possible shapes. An even more basic problem is to construct metrics on the space of shapes, measures for the dissimilarity of two shapes. It is striking how people ﬁnd it quite natural to be asked if some new object has a shape similar to some old object or category of objects. They act as though they carried a clear-cut psychophysical metric in their heads, although, when tested, their similarity judgments show a huge amount of context sensitivity. 2.3.1

manifold

The Space of Shapes and Some Basic Metrics on It

What do we mean by the space of shapes? The idea is simply to deﬁne this space as the set of 2-dimensional shapes, where a shape is taken to mean an open subset S ⊂ R2 with smooth boundary7 . We let S denote this set of shapes. The mathematician’s approach is to ask: what structure can we give to S to endow it with a geometry? In particular, we want to deﬁne (1) local coordinates on S, so that it is a manifold, (2) a metric on S, and (3) probability measures on S. Having probability measures will allow us to put shapes into our theory as hidden variables and extend the Bayesian inference machinery to include inferring shape variables from images. S itself is not a vector space: one cannot add and subtract two shapes in a way satisfying the usual laws of vectors. Put another way, there is, no obvious way to put global coordinates on S, that is to create a bijection between points of S and points in some vector space. One can, e.g. describe shapes by their Fourier coeﬃcients, but the Fourier coeﬃcients coming from shapes will be very special sequences of numbers. What we can do, however, is put a local linear structure on the space of shapes. This is illustrated in ﬁg 2.10. Starting from one shape S, we erect normal lines at each point of the boundary Γ of S. Then nearby shapes will have boundaries which intersect each normal line in a unique point. Suppose ψ(s) ∈ R2 is arc-length parameterization of Γ. Then the unit normal vector is given by n(s) = ψ ⊥(s) and each nearby curve is parameterized uniquely in the form ψa (s) = ψ(s) + a(s) · n(s),

for some function a(s).

44

Empirical Statistics and Stochastic Models for Visual Signals

Figure 2.10 The manifold structure on the space of shapes is here illustrated: all curves near the heavy one meet the normal “hairs” in a unique point, hence are described by a function, namely, how far this point has been displaced normally.

tangent space

All smooth functions a(s) which are suﬃciently small can be used, so we have created a bijection between an open set of functions a, that is an open set in a vector space, and a neighborhood of Γ ∈ S. These bijections are called charts and on overlaps of such charts, one can convert the a’s used to describe the curves in one chart into the functions in the other chart: this means we have a manifold. For details, see the paper (Michor and Mumford, 2006). Of course, the function a(s) lies in an inﬁnite-dimensional vector space, so S is an inﬁnite-dimensional manifold. But that is no deterrent to its having its own intrinsic geometry. Being a manifold means S has a tangent space at each point S ∈ S. This tangent space consists in the inﬁnitesimal deformations of S, i.e., those coming from inﬁnitesimal a(s). Dropping the , the inﬁnitesimal deformations may be thought of simply as normal vector ﬁelds to Γ, that is, the vector ﬁelds a(s) · n(s). We denote this tangent space as TS,S . How about metrics? In analysis, there are many metrics on spaces of functions and they vary in two diﬀerent ways. One choice is whether you make a worst-case analysis or an average analysis of the diﬀerence of two functions—or something in between. This means you deﬁne the diﬀerence of two functions a and b either as the supx |a(x) − b(x)|, the integral |a(x) − b(x)|dx, or as an Lp norm, ( |a(x) − b(x)|p dx)1/p (which is in between). The case p = ∞ corresponds to the sup, and p = 1 to the average. Usually, the three important cases8 are p = 1, 2, or ∞. The other choice is whether to include derivatives of a, b as well as the values of a, b in the formula for the distance and, if so, up to what order k. These distinctions carry

2.3

Probability Measures on the Space of Shapes

45

Figure 2.11 Each of the shapes A, B, C, D, and E is similar to the central shape, but in diﬀerent ways. Diﬀerent metrics on the space of shape bring out these

distinctions (adapted from B. Kimia). over to shapes. The best-known measures are the so-called Hausdorﬀ measure, d∞,0 (S, T ) = max sup inf x − y , sup inf x − y , x∈S y∈T

y∈T x∈S

for which p = ∞, k = 0, and the area metric, d1,0 (S, T ) = Area(S − S ∩ T ) ∪ Area(T − S ∩ T ), for which p = 1, k = 0. It is important to realize that there is no one right metric on S. Depending on the application, diﬀerent metrics are good. This is illustrated in ﬁg 2.11. The central bow-tie-like shape is similar to all the shapes around it. But diﬀerent metrics bring out their dissimilarities and similarities in each case. The Hausdorﬀ metric applied to the outsides of the shapes makes A far from the central shape; any metric using the ﬁrst derivative (i.e., the orientation of the tangent lines to the boundary) makes B far from the central shape; a sup-type metric with the second derivative (i.e., the curvature of the boundary) makes C far from the central shape, as curvature becomes inﬁnite at corners; D is far from the central shape in the area metric; E is far in all metrics, but the challenge is to ﬁnd a metric in which it is close to the central shape. E has “outliers,” the spikes, but is identical to the central shape if they can be ignored. To do this needs what are called robust metrics of which the simplest example is L1/2 (not a true metric at all). 2.3.2 Riemannian metrics

Riemannian Metrics and Probability Measures via Diﬀusion

There are great mathematical advantages to using L2 , so-called Riemannian metrics. More precisely, a Riemannian metric is given by deﬁning a quadratic inner product in the tangent space TS,S . In Riemannian settings, the unit balls are nice

46

Empirical Statistics and Stochastic Models for Visual Signals

and round and extremal problems, such as paths of shortest length, are usually well posed. This means we can expect to have geodesics, optimal deformations of one shape S to a second shape T through a family St of intermediate shapes, i.e., we can morph S to T in a most eﬃcient way. Having geodesics, we can study the geometry of S, for instance whether its geodesics diverge or converge9 —which depends on the curvature of S in the metric. But most important of all, we can deﬁne diﬀusion and use this to get Brownian paths and thus probability measures on S. A most surprising situation arises here: there are three completely diﬀerent ways to deﬁne Riemannian metrics on S. We need to assign a norm to normal vector ﬁelds a(s)n(s) along a simple closed plane curve Γ. local metric In inﬁnitesimal metric, the norm is deﬁned as an integral along Γ. In general, this can be any expression

2 F (a(s), a (s), a (s), · · · , κ(s), κ (s), · · · )ds, a = Γ

involving a function F quadratic in a and the derivatives of a whose coeﬃcients can possibly be functions associated to Γ like the curvature and its derivatives. We call these local metrics. We might have F = a(s)2 or F = (1 + Aκ2 (s)) · a(s)2 , where A is a constant; or F = a(s)2 + Aa (s)2 , etc. These metrics have been studied by Michor and Mumford (Michor and Mumford, 2006, 2005). Globally, the distance between two shapes is then

1 ∂St inf dt, d(S0 , S1 ) = ∂t paths {St } 0 where ∂St /∂t is the normal vector ﬁeld given by this path. diﬀeomorphism

In other situations, a morph of one shape to another needs to be considered as part of a morph of the whole plane. For this, the metric should be a quotient of a metric on the group G of diﬀeomorphisms of R2 , with some boundary condition, e.g., equal to the identity outside some large region. But an inﬁnitesimal diﬀeomorphism is just a vector ﬁeld v on R2 and the induced inﬁnitesimal deformation of Γ is given by a(s) = (v ·n(s)). Let V be the vector space of all vector ﬁelds on R2 , zero outside some large region. Then this means that the norm on a is

2 inf F (v , vx , vy , · · · )dxdy, a = v ∈V,( v · n)=a

Miller’s metric

R2

where we deﬁne an inner product on V using a symmetric positive deﬁnite quadratic expression in v and its partial derivatives. We might have F = v 2 or F = v 2 + A vx 2 + A vy 2 , etc. It is convenient to use integration by parts and write all such F ’s as (Lv , v ), where L is a positive deﬁnite partial diﬀerential operator (L = I − A in the second case above). These metrics have been studied by Miller, Younes, and their many collaborators (Miller, 2002; Miller and Younes, 2001) and applied extensively to the subject they call computational anatomy, that is, the analysis of medical scans by deforming them to template anatomies. Globally, the

2.3

Probability Measures on the Space of Shapes

47

A diﬀusion on the space of shapes in the Riemannian metric of Miller et al. The shapes should be imagined on top of each other, the translation to the right being added in order that each shape can be seen clearly. The diﬀusion starts at the unit circle. Figure 2.12

distance between two shapes is then 1/2

1 ∂φ −1 ◦ φ )dxdy F( dt, where dMiller (S, T ) = inf φ ∂t R2 0 φ(t), 0 ≤ t ≤ 1 is a path in G, φ(0) = I, φ(1)(S) = T.

Weil-Petersen metric

Finally, there is a remarkable and very special metric on S¯ = S modulo translations and scalings (i.e., one identiﬁes any two shapes which diﬀer by translation plus a scaling). It is derived from complex analysis and known as the Weil-Petersen (or WP) metric. Its importance is that it makes S¯ into a homogeneous metric space, that is, it has everywhere the same geometry. There is a group of global maps of S to itself which preserve distances in this metric and which can take any shape S to any other shape T . This is not the case with the previous metrics, hence the WP metric emerges as the analog of the standard Euclidean distance in ﬁnite dimensions. The deﬁnition is more elaborate and we do not give it here, see the paper (Mumford and Sharon, 2004). This metric also has negative or zero curvature in all directions and hence ﬁnite sets of shapes as well as probability measures on G¯ should always have a well-deﬁned mean (minimizing the sum of squares of distances) in this metric. Finally, this metric is closely related to the medial axis, which has been frequently used for shape classiﬁcation. The next step in each of these theories is to investigate the heat kernel, the solution of the heat equation starting at a delta function. This important question has not been studied yet. But diﬀusions in these metrics are easy to simulate. In ﬁg 2.12 we show three random walks in S in one of Miller’s metrics. The analog of Gaussian distributions are the probability measures gotten by stopping diﬀusion at a speciﬁc point in time. And analogs of the scale mixtures of Gaussians discussed above are obtained by using a so-called random stopping time, that is, choosing the time to halt the diﬀusion randomly from another probability distribution. It seems clear that one or more of these diﬀusion measures are natural general-purpose priors on the space of shapes.

48

Empirical Statistics and Stochastic Models for Visual Signals

2.3.3

Finite Approximations and Some Elementary Probability Measures

A completely diﬀerent approach is to infer probability measures directly from data. Instead of seeking general-purpose priors for stochastic models, one seeks special-purpose models for speciﬁc object-recognition tasks. This has been done by extracting from the data a ﬁnite set of landmark points, homologous points which can be found on each sample shape. For example, in 3 dimensions, skulls have long been compared by taking measurements of distances between classical landmark points. In 2 dimensions, assuming these points are on the boundary of the shape, the inﬁnite dimensional space S is replaced by the ﬁnite dimensional space of the polygons {P1 , · · · , Pk } ∈ R2k formed by these landmarks. But, if we start from images, we can allow the landmark points to lie in the interior of the shape also. This approach was introduced a long time ago to study faces. More speciﬁcally, it was used by Cootes et al. (1993) and by Hallinan et al. (1999) to ﬁt multidimensional Gaussians to the cloud of points in R2k formed from landmark points on each of a large set of faces. Both groups then apply principal component analysis (PCA) and ﬁnd the main directions for face variation. However, it seems unlikely to me that Gaussians can give a very good ﬁt. I suspect rather that in geometric situations as well, one will encounter the high kurtosis phenomenon, with geometric features often near zero but, more often than for Gaussian variables, very large too. A ﬁrst attempt to quantify this point of view was made by Zhu (1999). He took a database of silhouettes of four-legged animals and he computed landmark points, medial axis, and curvature for each silhouette. Then he ﬁt a general exponential model to a set of six scalar variables describing this geometry. The strongest test of whether he has captured some of their essential shape properties is to sample from the model he gets. The results are shown in ﬁg 2.13. It seems to me that these models are getting much closer to the sort of special-purpose prior that is needed in object-recognition programs. Whether his models have continuum limits and of what sort is an open question. There are really three goals for a theory of shapes adapted to the analysis of images. The ﬁrst is to understand better the global geometry of S and which metrics are appropriate in which vision applications. The second is to create the best general-purpose priors on this space, which can apply to arbitrary shapes. The third is to mold special-purpose priors to all types of shapes which are encountered frequently, to express their speciﬁc variability. Some progress has been made on all three of these but much is left to be done.

2.4

Summary Solving the problem of vision requires solving three subproblems: ﬁnding the right classes of stochastic models to express accurately the variability of visual patterns in nature, ﬁnding ways to learn the details of these models from data, and ﬁnding ways to reason rapidly using Bayesian inference on these models. This chapter has

NOTES

49

Six “animals” that never existed: they are random samples from the prior of S. C. Zhu trained on real animal silhouettes. The interior lines come from his use of medial axis techniques to generate the shapes.

Figure 2.13

addressed the ﬁrst. Here a great deal of progress has been made but it must be said that much remains to be done. My own belief is that good theories of groupings are the biggest gap. Although not discussed in this article, let me add that great progress has been made on the second and third problems with a large number of ideas, e.g., the expectation maximization (EM) algorithm, much faster Monte Carlo algorithms, maximum entropy (MaxEnt) methods to ﬁt exponential models, Bayesian belief propagation, particle ﬁltering, and graph-theoretic techniques.

Notes

1 The chief mistake of the Greeks was their persistent belief that the eye must emit some sort of ray in order to do something equivalent to touching the visible surfaces. 2 This is certainly biologically unrealistic. Life requires rapid analysis of changing scenes. But this article, like much of vision research, simpliﬁes its analysis by ignoring time. 3 It is the second idea that helps to explain why aerial photographs also show approximate scale invariance. 4 The infrared divergence is readily solved by considering images mod constants. If the pixel values are log of the photon energy, this constant is an irrelevant gain factor. 5 Some have found an especially large concentration near 1/f 1.8 or 1/f 1.9 , especially for forest scenes (Ruderman and Bialek). 6 Scale invariance implies that its expected power at spatial frequency (ξ, η) is a constant times 1/(ξ 2 + η 2 ) and integrating this over (ξ, η) gives ∞. 7 A set S of points is open if S contains a small disk of points around each point x ∈ S. Smooth means that it is a curve that is locally a graph of a function with inﬁnitely many derivatives; in many applications, one may want to include shapes with corners. We simplify the discussion here and assume there are no corners. 8 Charpiat et al., however, have used p-norm as for p 1 in order to “tame” L∞ norms. 9 This is a key consideration when seeking means to clusters of ﬁnite sets of shapes and in seeking principal components of such clusters.

3

The Machine Cocktail Party Problem

Simon Haykin and Zhe Chen

cocktail problem

party

machine cocktail party problem

Imagine you are in a cocktail party environment with background music, and you are participating in a conversation with one or more of your friends. Despite the noisy background, you are able to converse with your friends, switching from one to another with relative ease. Is it possible to build an intelligent machine that is able to perform like yourself in such a noisy environment? This chapter explores such a possibility. The cocktail party problem (CPP), ﬁrst proposed by Colin Cherry, is a psychoacoustic phenomenon that refers to the remarkable human ability to selectively attend to and recognize one source of auditory input in a noisy environment, where the hearing interference is produced by competing speech sounds or various noise sources, all of which are usually assumed to be independent of each other (Cherry, 1953). Following the early pioneering work (Cherry, 1953, 1957, 1961; Cherry and Taylor, 1954), numerous eﬀorts have been dedicated to the CPP in diverse ﬁelds: physiology, neurobiology, psychophysiology, cognitive psychology, biophysics, computer science, and engineering.1 Over half a century after Cherry’s seminal work, however, it is fair to say that a complete understanding of the cocktail party phenomenon is still missing, and the story is far from being complete; the marvelous auditory perception capability of human beings remains enigmatic. To unveil the mystery and thereby imitate human performance by means of a machine, computational neuroscientists, computer scientists, and engineers have attempted to view and simplify this complex perceptual task as a learning problem, for which a tractable computational solution is sought. An important lesson learned from the collective work of all these researchers is that in order to imitate a human’s unbeatable audition capability, a deep understanding of the human auditory system is crucial. This does not mean that we must duplicate every aspect of the human auditory system in solving the machine cocktail party problem, hereafter referred to as the machine CPP for short. Rather, the challenge is to expand on what we know about the human auditory system and put it to practical use by exploiting advanced computing and signal-processing technologies (e.g., microphone arrays, parallel computers, and VLSI chips). An eﬃcient and eﬀective solution to the machine CPP will not only be a major accomplishment in its own right, but it will also have a direct impact on ongoing research in artiﬁcial intelligence (such as robotics)

52

The Machine Cocktail Party Problem

and human-machine interfaces (such as hearing aids); and these lines of research will, in their own individual ways, further deepen our understanding of the human brain. There are three fundamental questions pertaining to the CPP: 1. What is the cocktail party problem? 2. How does the brain solve it? 3. Is it possible to build a machine capable of solving it in a satisfactory manner? The ﬁrst two questions are human oriented, and mainly involve the disciplines of neuroscience, cognitive psychology, and psychoacoustics; the last question is rooted in machine learning, which involves computer science and engineering disciplines. While these three issues are equally important, this chapter will focus on the third question by addressing a solution to the machine CPP. To understand the CPP, we may identify three underlying neural processes:2 Analysis: The analysis process mainly involves segmentation or segregation, which refers to the segmentation of an incoming auditory signal to individual channels or streams. Among the heuristics used by a listener to do the segmentation, spatial location is perhaps the most important. Speciﬁcally, sounds coming from the same location are grouped together, while sounds originating from other diﬀerent directions are segregated. Recognition: The recognition process involves analyzing the statistical structure of essential patterns contained in a sound stream. The goal of this process is to uncover the neurobiological mechanisms through which humans are able to identify a segregated sound from multiple streams with relative ease. Synthesis: The synthesis process involves the reconstruction of individual sound waveforms from the separated sound streams. While synthesis is an important process carried out in the brain, the synthesis problem is of primary interest to the machine CPP. From an engineering viewpoint, we may, in a loose sense, regard synthesis as the inverse of the combination of analysis and recognition in that synthesis attempts to uncover relevant attributes of the speech production mechanism. Note also that, insofar as the machine CPP is concerned, an accurate synthesis does not necessarily mean having solved the analysis and recognition problems, although additional information on these two problems might provide more hints for the synthesis process. Bearing in mind that the goal of solving the machine CPP is to build an intelligent machine that can operate eﬃciently and eﬀectively in a noisy cocktail party environment, we propose a computational framework for active audition that has the potential to serve this purpose. To pave the way for describing this framework, we will discuss the important aspects of human auditory scene analysis and computational auditory scene analysis. Before proceeding to do so, however, some historical notes on the CPP are in order.

3.1

3.1

Some Historical Notes

53

Some Historical Notes In the historical notes that follow, we do two things. First, we present highlights of the pioneering experiments performed by Colin Cherry over half a century ago, which are as valid today as they were then; along the way, we also refer to the other related works. Second, we highlight three machine learning approaches: independent component analysis, oscillatory correlation, and cortronic processing, which have been motivated by the CPP in one form or another. 3.1.1

Cherry’s Early Experiments

In the early 1950s, Cherry became interested in the remarkable hearing capability of human beings in a cocktail party environment. He himself raised several questions: What is our selective attention ability? How are we able to select information coming from multiple sources? Some information is still retained even when we pay no attention to it; how much information is retained? To answer these fundamental questions, Cherry (1953) compared the ability of listeners to attend to two diﬀerent spoken messages under diﬀerent scenarios. In his classic experimental set-up called dichotic listening, the recorded messages were mixed and presented together to the same ear of a subject over headphones, and the listeners were requested to test the intelligibility3 of the message and repeat each word of the message to be heard, a task that is referred to as shadowing. In the cited paper, Cherry reported that when one message is delivered to one ear (the attended channel) and a diﬀerent message is delivered to the other ear (the unattended channel), listeners can easily attend to one or the other of these two messages, with almost all of the information in the attended message being determined, while very little about the unattended message is recalled. It was also found that the listeners became quite good at the shadowing task after a few minutes, repeating the attended speech quite accurately. However, after a few minutes of shadowing, listeners had no idea of what the unattended voice was about, or even if English was spoken or not. Based on these observations and others, Cherry conjectured that some sort of spatial ﬁltering of the concurrently occurring sounds/voices might be helpful in attending to the message. It is noteworthy that Cherry (1953) also suggested some procedures to design a “ﬁlter” (machine) to solve the CPP, accounting for the following: (1) the voices come from diﬀerent directions; (2) lip reading, gesture, and the like; (3) diﬀerent speaking voices, mean pitches, mean speeds, male vs. female, and so forth; (4) diﬀerent accents and linguistic factors; and (5) transition probabilities (based on subject matter, voice dynamics, syntax, etc.). In addition, Cherry also speculated that humans have a vast memory of transition probabilities that make the task of hearing much easier by allowing prediction of word sequences. The main ﬁndings of the dichotic listening experiments conducted by Cherry and others have revealed that, in general, it is diﬃcult to attend to two sound sources at once; and when we switch attention to an unattended source (e.g., by

54

The Machine Cocktail Party Problem

listening to a spoken name), we may lose information from the attended source. Indeed, our own common experiences teach us that when we attempt to tackle more than one task at a time, we may end up sacriﬁcing performance. In subsequent joint investigations with colleagues (Cherry, 1961; Cherry and Sayers, 1956, 1959; Sayers and Cherry, 1957), Cherry also studied the binaural fusion mechanism and proposed a cross-correlation-based technique for measuring certain parameters of speech intelligibility. Basically, it was hypothesized that the brain performs correlation on signals received by the two ears, playing the role of localization and coincidence detection. In the binaural fusion studies, Sayers and Cherry (1957) showed that the human brain does indeed execute short-term correlation analysis for either monaural or binaural listening. To sum up, Cherry not only coined the term “cocktail party problem,” but also was the ﬁrst experimentalist to investigate the beneﬁts of binaural hearing and point to the potential of lip-reading, etc., for improved hearing, and to emphasize the critical role of correlation in binaural fusion—Cherry was indeed a pioneer of human communication. 3.1.2 independent component analysis

Independent Component Analysis

The development of independent component analysis (ICA) was partially motivated by a desire to solve a cocktail party problem. The essence of ICA can be stated as follows: Given an instantaneous linear mixture of signals produced by a set of sources, devise an algorithm that exploits a statistical discriminant to diﬀerentiate these sources so as to provide for the separation of the source signals in a blind manner (Bell and Sejnowski, 1995; Comon, 1994; Jutten and Herault, 1991). The key question is how? To address this question, we ﬁrst recognize that if we are to achieve the blind separation of an instantaneous linear mixture of independent source signals, then there must be a characteristic departure from the simplest possible source model: an independently and identically distributed (i.i.d.) Gaussian model; violation of which will give rise to a more complex source model. The departure can arise in three diﬀerent ways, depending on which of the three characteristic assumptions embodied in this simple source model is broken, as summarized here (Cardoso, 2001): Non-Gaussian i.i.d. model: In this route to blind source separation, the i.i.d. assumption for the source signals is retained but the Gaussian assumption is abandoned for all the sources, except possibly for one of them. The Infomax algorithm due to Bell and Sejnowski (1995), the natural gradient algorithm due to Amari et al. (1996), Cardoso’s JADE algorithm (Cardoso, 1998; Cardoso and Souloumiac, 1993), and the FastICA algorithm due to Hyv¨ arinen and Oja (1997) are all based on the non-Gaussian i.i.d. model. Besides, these algorithms diﬀer from each other in the way in which incoming source information residing in higher-order statistics (HoS) is exploited.

3.1

Some Historical Notes

55

Gaussian non-stationary model: In this second route to blind source separation, the Gaussian assumption is retained for all the sources, which means that secondorder statistics (i.e., mean and variance) are suﬃcient for characterizing each source signal. Blind source separation is achieved by exploiting the property of nonstationarity, provided that the source signals diﬀer from each other in the ways in which their statistics vary with time. This approach to blind source separation was ﬁrst described by Parra and Spence (2000) and Pham and Cardoso (2001). Whereas the algorithms focusing on the non-Gaussian i.i.d. model operate in the time domain, the algorithms that belong to the Gaussian nonstationary model operate in the frequency domain, a feature that also makes it possible for the second class of ICA algorithms to work with convolutive mixtures. Gaussian, stationary, correlated-in-time model: In this third and ﬁnal route to blind source separation, the blind separation of Gaussian stationary source signals is achieved on the proviso that their power spectra are not proportional to each other. Recognizing that the power spectrum of a wide-sense stationary random process is related to the autocorrelation function via the Wiener-Khintchine theorem, spectral diﬀerences among the source signals translate to corresponding diﬀerences in correlated-in-time behavior of the source signals. It is this latter property that is available for exploitation. To sum up, Comon’s 1994 paper and the 1995 paper by Bell and Sejnowski have been the catalysts for the literature in ICA theory, algorithms, and novel applications. Indeed, the literature is so extensive and diverse that in the course of ten years, ICA has established itself as an indispensable part of the ever-expanding discipline of statistical signal processing, and has had a great impact on neuroscience (Brown et al., 2001). On technical grounds, however, Haykin and Chen (2005) justify the statement that ICA does not solve the cocktail party problem; rather, it addresses a blind source separation (BSS) problem. 3.1.3 temporal binding

Temporal Binding and Oscillatory Correlation

Temporal binding theory was most elegantly illustrated by von der Malsburg (1981) in his seminal technical report entitled “Correlation Theory of Brain Function,” in which he made two important observations: (1) the binding mechanism is accomplished by virtue of correlation between presynaptic and postsynaptic activities, and (2) strengths of the synapses follow Hebb’s postulate of learning. When the synchrony between the presynaptic and postsynaptic activities is strong (weak), the synaptic strength would correspondingly increase (decrease) temporally. Moreover, von der Malsburg suggested a dynamic link architecture to solve the temporal binding problem by letting neural signals ﬂuctuate in time and by synchronizing those sets of neurons that are to be bound together into a higher-level symbol/concept. Using the same idea, von der Malsburg and Schneider (1986) proposed a solution to the cocktail party problem. In particular, they developed a neural cocktail-party processor that uses synchronization and desynchronization to segment the incom-

56

The Machine Cocktail Party Problem

ing sensory inputs. Though merely based on simple experiments (where von der Malsburg and Schneider used amplitude modulation and stimulus onset synchrony as the main acoustic cues, in line with Helmholtz’s suggestion), the underlying idea is illuminating in that the model is consistent with anatomic and physiologic observations. The original idea of von der Malsburg was subsequently extended to diﬀerent sensory domains, whereby phases of neural oscillators were used to encode the binding of sensory components (Brown and Wang, 1997; Wang et al., 1990). Of particular interest is the two-layer oscillator model due to Wang and Brown (1999). The aim of this model is to achieve “searchlight attention” by examining the temporal cross-correlation between the activities of pairs (or populations) of neurons. The ﬁrst layer, segmentation layer, acts as a locally excitatory, globally inhibitory oscillator; and the second layer, grouping layer, essentially performs computational auditory scene analysis (CASA). Preceding the oscillator network, there is an auditory periphery model (cochlear and hair cells) as well as a middle-level auditory representation stage (correlogram). As reported by Wang and Brown (1999), the model is capable of segregating a mixture of voiced speech and diﬀerent interfering sounds, thereby improving the signal-to-noise ratio (SNR) of the attended speech signal. The correlated neural oscillator is arguably biologically plausible; however, unlike ICA algorithms, the performance of the neural oscillator model appears to deteriorate signiﬁcantly in the presence of multiple competing sources. 3.1.4

Cortronic Processing

The idea of so-called cortronic network was motivated by the fact that the human brain employs an eﬃcient sparse coding scheme to extract the features of sensory inputs and accesses them through associative memory (Hecht-Nielsen, 1998). Speciﬁcally, in (Sagi et al., 2001), the CPP is viewed as an aspect of the human speech recognition problem in a cocktail party environment, and the solution is regarded as an attended source identiﬁcation problem. In the experiments reported therein, only one microphone was used to record the auditory scene; however, the listener was assumed to be familiar with the language of conversation under study. Moreover, all the subjects were chosen to speak the same language and have the similar voice qualities. The goal of the cortronic network is to identify one attended speech of interest, which is the essence of the CPP. According to the experimental results reported in (Sagi et al., 2001), it appears that the cortronic network is quite robust with respect to variations in speech, speaker, and noise, even under a −8 dB SNR (with a single microphone). Compared to other computational approaches proposed to solve the CPP, the cortronic network distinguishes itself by exploiting prior knowledge pertaining to speech and the spoken language; it also implicitly conﬁrms the validity of Cherry’s early speculation in terms of the use of memory (recalling Section 3.1.1)

3.2

3.2

Human Auditory Scene Analysis: An Overview

57

Human Auditory Scene Analysis: An Overview Human auditory scene analysis (ASA) is a general process carried out by the auditory system of a human listener for the purpose of extracting information pertaining to a sound source of interest, which is embedded in a background of noise or interference. The auditory system is made up of two ears (constituting the organs of hearing) and auditory pathways. In more speciﬁc terms, it is a sophisticated informationprocessing system that enables us to detect not only the frequency composition of an incoming sound wave but also locate the sound sources (Kandel et al., 2000). This is all the more remarkable, given the fact that the energy in the incoming sound waves is exceedingly small and the frequency composition of most sounds is rather complicated. 3.2.1

Where and What

The mechanisms in auditory perception essentially involve two processes: sound localization (“where”) and sound recognition (“what”). It is well known that for localizing sound sources in the azimuthal plane, interaural time diﬀerence (ITD) is the main acoustic cue for sound location at low frequencies; and for complex stimuli with low-frequency repetition, interaural level is the main cue for sound localization at high frequencies (Blauert, 1983; Yost, 2000; Yost and Gourevitch, 1987). Spectral diﬀerences provided by the head-related transfer function (HRTF) are the main cues used for vertical localization. Loudness (intensity) and early reﬂections are possible cues for localization as a function of distance. In hearing, the precedence eﬀect refers to the phenomenon that occurs during auditory fusion when two sounds of the same order of magnitude are presented dichotically and produce localization of the secondary sound waves toward the outer ear receiving the ﬁrst sound stimulus (Yost, 2000); the precedence eﬀect stresses the importance of the ﬁrst wave in determining the sound location. The what question mainly addresses the processes of sound segregation (streaming) and sound determination (identiﬁcation). While having a critical role in sound localization, spatial separation is not considered to be a strong acoustic cue for streaming or segregation (Bregman, 1990). According to Bregman’s studies, sound segregation consists of a two-stage process: feature selection and feature grouping. Feature selection invokes processing the auditory stimuli into a collection of favorable (e.g., frequency-sensitive, pitch-related, temporal-spectral-like) features. Feature grouping, on the other hand, is responsible for combining similar elements of incoming sounds according to certain principles into one or more coherent streams, with each stream corresponding to one informative sound source. Sound determination is more speciﬁc than segregation in that it not only involves segmentation of the incoming sound into diﬀerent streams, but also identiﬁes the content of the sound source in question.

58

The Machine Cocktail Party Problem

3.2.2

Spatial Hearing

From a communication perspective, our two outer ears act as receive antennae for acoustic signals from a speaker or audio source. In the presence of one (or fewer) competing or masking sound source(s), the human ability to detect and understand the source of interest (i.e., target) is degraded. However, the inﬂuence of masking source(s) generally decreases when the target and masker(s) are spatially separated, compared to when the target and masker(s) are in the same location; this eﬀect is credited to spatial hearing (ﬁltering). As pointed out in section 3.1.1, Cherry (1953) suggested that spatial hearing plays a major role in the auditory system’s ability to separate sound sources in a multiple-source acoustic environment. Many subsequent experiments have veriﬁed Cherry’s conjecture. Speciﬁcally, directional hearing (Yost and Gourevitch, 1987) is crucial for suppressing the interference and enhancing speech intelligibility (Bronkhorst, 2000; Hawley et al., 1999). Spatial separation of the sound sources is also believed to be more beneﬁcial to localization than segregation (Bregman, 1990). The classic book by Blauert (1983) presents a comprehensive treatment of the psychophysical aspect of human sound localization. Given multiple-sound sources in an enclosed space (such as a conference room), spatial hearing helps the brain to take full advantage of slight diﬀerences (e.g., timing, intensity) between the signals that reach the two outer ears. This is done to perform monaural (autocorrelation) and binaural (cross-correlation) processing for speciﬁc tasks (such as coincidence detection, precedence detection, localization, and fusion), based on which auditory events are identiﬁed and followed by higher-level auditory processing (i.e., attention, streaming, and cognition). Fig. 3.1 provides a functional diagram of the binaural spatial hearing process. 3.2.3

Binaural Processing

One of the key observations derived from Cherry’s classic experiment described in section 3.1.1 is that it is easier to separate sources heard binaurally than when they are heard monaurally. Quoting from Cherry and Taylor (1954): “One of the most striking facts about our ears is that we have two of them—and yet we hear one acoustic world; only one voice per speaker.” We believe that nature gives us two ears for a reason just like it gives us two eyes. It is the binocular vision (stereovision) and binaural hearing (stereausis) that enable us to perceive the dynamic world and provide the main sensory information sources. Binocular/binaural processing is considered to be crucial in certain perceptual activities (e.g., binocular/binaural fusion, depth perception, localization). Given one sound source, the two ears receive slightly diﬀerent sound patterns due to a ﬁnite delay produced by their physically separated locations. The brain is known to be extremely eﬃcient in extracting and then using diﬀerent acoustic cues (to be discussed in detail later) to perform speciﬁc audition tasks. An inﬂuential binaural phenomenon is the so-called binaural masking (e.g.,

(expectation-driven) (signal-driven)

top-down

Human Auditory Scene Analysis: An Overview

bottom-up

3.2

59

cognition

visual cues formation of the auditory event

consciousness attention segregation

fusion autocorrelation

crosscorrelation

autocorrelation

coincidence detection localization

A functional diagram of of binaural hearing, which consists of physical, psychophysical, and psychological aspects of auditory perception. Adapted from Blauert (1983) with permission.

Figure 3.1

Moore, 1997; Yost, 2000). The threshold of detecting a signal masked in noise can sometimes be lower when listening with two ears than it is when listening with only one ear, which is demonstrated by a phenomenon called binaural masking level diﬀerence (BMLD). It is known (Yost, 2000) that the masked threshold of a signal is the same when the stimuli are presented in a monotic or diotic condition; when the masker and the signal are presented in a dichotic situation, the signal has a lower threshold than in either monotic or diotic conditions. Similarly, many experiments have also veriﬁed that binaural hearing increases speech intelligibility when the speech signal and noise are presented dichotically. Another important binaural phenomenon is binaural fusion, which is the essence of directional hearing. As pointed out in section 3.1.1, the fusion mechanism is naturally modeled as performing some kind of correlation (Cherry, 1961; Cherry and Sayers, 1956), for which a binaural fusion model based on the autocorrelogram and cross-correlogram was proposed (as illustrated in ﬁg. 3.1).

60

3.3

The Machine Cocktail Party Problem

Computational Auditory Scene Analysis In contrast to the human auditory scene analysis (ASA), computational auditory scene analysis (CASA) relies on the development of a computational model of the auditory scene with one of two goals in mind, depending on the application of interest: The design of an intelligent machine, which, by itself, is able to automatically extract and track a sound signal of interest in a cocktail party environment. The design of an adaptive hearing system, which computes the perceptual grouping process missing from the auditory system of a hearing-impaired individual, thereby enabling that individual to attend to a sound signal of interest in a cocktail party environment. Naturally, CASA is motivated by or builds on the understanding we have of human auditory scene analysis, or even more generally, the understanding of human cognition behavior. Following the seminal work of Bregman (1990), many researchers (see e.g., Brown and Cooke, 1994; Cooke, 1993; Cooke and Ellis, 2001; Rosenthal and Okuno, 1998) have tried to exploit the CASA in diﬀerent ways. Representative approaches include the data-driven scheme (Cooke, 1993) and the prediction-driven scheme (Ellis, 1996). The common feature in these two schemes is to integrate low-level (bottom-up, primitive) acoustic cues for potential grouping. The main diﬀerences between them are: Data-driven CASA aims to decompose the auditory scene into time-frequency elements (“strands”), and then run the grouping procedure. On the other hand, prediction-driven CASA views prediction as the primary goal, and it requires only a world model that is consistent with the stimulus; it contains integration of top-down and bottom-up cues and can deal with incomplete or masked data (i.e., speech signal with missing information). However, as emphasized by Bregman (1996), it is important for CASA modelers to take into account psychological data as well as the way humans carry out ASA; namely, modeling the stability of human ASA, making it possible for diﬀerent cues to cooperate and compete, and accounting for the propagation of constraints across the frequencyby-time ﬁeld. 3.3.1

acoustic cue

Acoustic Cues

The psychophysical attributes of sound mainly involve three forms of information: spatial location, temporal structure, and spectral characterization. The perception of a sound signal in a cocktail party environment is uniquely determined by this kind of collective information; any diﬀerence in any of the three forms of information is believed to be suﬃcient to discriminate two diﬀerent sound sources. In sound perception, many acoustic features (cues) are used to perform speciﬁc tasks. Table 3.1 summarizes some visual/acoustic features (i.e., the spatial, temporal, or

3.3

Computational Auditory Scene Analysis Table 3.1

61

The Features and Cues Used in Sound Perception Feature/Cue

Domain

Task

Visual ITD IID Intensity, loudness Periodicity Onsets AM FM Pitch Timbre, tone Hamonicity, formant

Spatial Spatial Spatial Temporal Temporal Temporal Temporal Temporal-spectral Spectral Spectral Spectral

“Where” “Where” “Where” “Where” + “What” “What” “What” “What” “What” “What” “What” “What”

spectral patterns) used for a single-stream sound perception. A brief description of some important acoustic cues listed in table 3.1 is in order: Interaural time diﬀerence (ITD): A measure of the diﬀerence between the time at which the sound waves reach the left ear and the time at which the same sound waves reach the right ear. Interaural intensity diﬀerence (IID): A measure of the diﬀerence in intensity of the sound waves reaching the two ears due to head shadow. Amplitude modulation (AM): A method of sound signal transmission whereby the amplitude of some carrier frequency is modiﬁed in accordance with the sound signal. Frequency modulation (FM): Another method of modulation in which the instantaneous frequency of the carrier is varied with the frequency of the sound signal. Onset: A sudden increase in the energy of a sound signal; as such, each discrete event in the sound signal has an onset. Pitch: A property of auditory sensation in which sounds are ordered on a musical scale; in a way, pitch bears a relationship to frequency that is similar to the relationship of loudness to intensity. Timbre: The attribute of auditory sensation by which a listener is able to discriminate between two sound signals of similar loudness and pitch, but of diﬀerent tonal quality; timbre depends primarily on the spectrum of a sound signal. A combination of some or more of these acoustic cues is the key to perform CASA. Psychophysical evidence also suggests that useful cues may be provided by spectraltemporal correlations (Feng and Ratnam, 2000). 3.3.2 feature binding

Feature Binding

One other important function involved in CASA is that of feature binding, which

62

The Machine Cocktail Party Problem

refers to the problem of representing conjunctions of features. According to von der Malsburg (1999), binding is a general process that applies to all types of knowledge representations, which extend from the most basic perceptual representation to the most complex cognitive representation. Feature binding may be either static or dynamic. Static feature binding involves a representational unit that stands for a speciﬁc conjunction of properties, whereas dynamic feature binding involves conjunctions of properties as the binding of units in the representation of an auditory scene. The most popular dynamic binding mechanism is based on temporal synchrony, hence the reference to it as “temporal binding”; this form of binding was discussed in section 3.1.3. K¨onig et al. (1996) have suggested that synchronous ﬁring of neurons plays an important role in information processing within the cortex. Rather than being a temporal integrator, the cortical neurons might be serving the purpose of a coincidence detector, evidence for which has been addressed by many researchers (K¨onig and Engel, 1995; Schultz et al., 2000; Singer, 1993). Dynamic binding is closely related to the attention mechanism, which is used to control the synchronized activities of diﬀerent assemblies of units and how the ﬁnite binding resource is allocated among neuronal assemblies (Singer, 1993, 1995). Experimental evidence has shown that synchronized ﬁring tends to provide the attended stimulus with an enhanced representation. 3.3.3

Dereverberation

For auditory scene analysis, studying the eﬀect of room acoustics on the cocktail party environment is important (Blauert, 1983; MacLean, 1959). A conversation occurring in a closed room often suﬀers from the multipath eﬀect: echoes and reverberation, which are almost ubiquitous but rarely consciously noticed. According to the acoustics of the room, a reﬂection from one surface (e.g., wall, ground) produces reverberation. In the time domain, the reﬂection manifests itself as smaller, delayed replicas (echoes) that are added to the original sound; in the frequency domain, the reﬂection introduces a comb-ﬁlter eﬀect into the frequency response. When the room is large, echoes can sometimes be consciously heard. It is known that the human auditory system is so powerful that it can take advantage of binaural and spatial hearing to eﬃciently suppress the echo, thereby improving the hearing performance. However, for a machine CPP, the machine design would have to include speciﬁc dereverberation (or deconvolution) algorithms to overcome this eﬀect. Those acoustic cues listed in table 3.1 that are spatially dependent, such as ITD and IID, are naturally aﬀected by reverberation. On the other hand, acoustic cues that are space invariant, such as common onset across frequencies and pitch, are less sensitive to reverberation. On this basis, we may say that an intelligent machine should have the ability to adaptively weight the spatially dependent acoustic cues (prior to their fusion) so as to deal with a reverberant environment in an eﬀective manner.

3.4

3.4

Insights from Computational Vision

63

Insights from Computational Vision It is well known to neuroscientists that audition (hearing) and vision (seeing) share substantial common features in the sensory processing principles as well as anatomic and functional organizations in higher-level centers in the cortex. It is therefore highly informative that with the design of an eﬀective and eﬃcient machine CPP as a design goal, we address the issue of deriving insights from the extensive literature on computational vision. We do so in this section by ﬁrst looking to Marr’s classic vision theory. 3.4.1

Marr’s Vision Theory and Its Insights for Auditory Scene Analysis

In his landmark book, David Marr presented three levels of analysis of informationprocessing systems (Marr, 1982): Computation: What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? Representation: How can this computational theory be implemented? In particular, what is the representation for the input and output, and what is the algorithm for the transformation? Implementation: How can the representation and the algorithm be realized physically?

grouping principle

In many perspectives, Marr’s observations highlight the fundamental questions that need to be addressed in computational neuroscience, not only in the context of vision but also audition. As a matter of fact, Marr’s theory has provided many insights into auditory research (Bregman, 1990; Rosenthal and Okuno, 1998). In a similar vein to visual scene analysis (e.g., Julesz and Hirsh, 1972), auditory scene analysis (Bregman, 1990) attempts to identify the content (what) and the location (where) of the sounds/speech in an auditory environment. In speciﬁc terms, auditory scene analysis consists of two stages. In the ﬁrst stage, the segmentation process decomposes a complex acoustic scene into a collection of distinct sensory elements; in the second stage, the grouping process combines these elements into a stream according to some principles. Subsequently, the streams are interpreted by a higher-level process for recognition and scene understanding. Motivated by Gestalt psychology, Bregman (1990) has proposed ﬁve grouping principles for ASA: Proximity: Characterizes the distances between auditory cues (features) with respect to their onsets, pitch, and intensity (loudness). Similarity: Usually depends on the properties of a sound signal, such as timbre. Continuity: Features the smoothly-varying spectrum of a sound signal. Closure: Completes fragmentary features that have a good gestalt; the completion may be viewed as a form of auditory compensation for masking.

64

The Machine Cocktail Party Problem

Common fate: Groups together activities (e.g., common onsets) that are synchronous. Moreover, Bregman (1990) has distinguished at least two levels of auditory organization: primitive streaming and schema-based segregation, with schemas being provided by phonetic, prosodic, syntactic, and semantic forms of information. While being applicable to general sound scene analysis involving speech and music, Bregman’s work has focused mainly on primitive stream segregation. 3.4.2

A Tale of Two Sides: Visual and Auditory Perception

Visual perception and auditory perception share many common features in terms of sensory processing principles. According to Shamma (2001), these common features include the following: Lateral inhibition for edge/peak enhancement: In an auditory task, it aims to extract the proﬁle of the sound spectrum; whereas in the visual system it aims to extract the form of an image. Multiscale analysis: The auditory system performs the cortical spectrotemporal analysis, whereas the visual system performs the cortical spatiotemporal form analysis. Detecting temporal coincidence: This process may serve periodicity pitch perception in an auditory scene compared to the perception of bilateral symmetry in a visual task. Detecting spatial coincidence: The same algorithm captures binaural azimuthal localization in the auditory system (stereausis), while it gives rise to binocular depth perception in the visual system. Not only sharing these common features and processes, the auditory system also beneﬁts from the visual system. For example, it is well known that there exist interactions between diﬀerent sensory modalities. Neuroanatomy reveals the existence of corticocortical pathways between auditory and visual cortices. The hierarchical organization of cortices and numerous thalamocortical and corticothalamic feedback loops are speculated to stabilize the perceptual object. Daily life experiences also teach us that a visual scene input (e.g., lip reading) is inﬂuential to attention (Jones and Yee, 1996) and beneﬁcial to speech perception. The McGurk eﬀect (McGurk and MacDonald, 1976) is an auditory-visual speech illusion experiment, in which the perception of a speech sound is modiﬁed by contradictory visual information. The McGurk eﬀect clearly illustrates the important role played by a visual cue in the comprehension of speech. 3.4.3

Active Vision

In the last paragraph of the introduction section, we referred to active audition as having the potential to build an intelligent machine that can operate eﬃciently

3.4

Insights from Computational Vision

65

and eﬀectively in a noisy cocktail party environment. The proposal to build such a machine has been inspired by two factors: ongoing research on the use of active vision in computational vision, and the sharing of many common sensory principles between visual perception and auditory perception, as discussed earlier. To pave the way for what we have in mind on a framework for active audition, it is in order that we present some highlights on active vision that are of value to a formulation of this framework. First and foremost, it is important to note that the use of an active sensor is not a necessary requirement for active sensing, be that in the context of vision or audition. Rather, a passive sensor (which only receives but does not transmit information-bearing signals) can perform active sensing, provided that the sensor is capable of changing its own state parameters in accordance with a desired sensing strategy. As such, active sensing may be viewed as an application of intelligent control theory, which includes not only control but also reasoning and decision making (Bajcsy, 1988).4 In particular, active sensing embodies the use of feedback in two contexts: 1. The feedback is performed on complex processed sensory data such as extracted features that may also include relational features. 2. The feedback is dependent on prior knowledge. active vision

Active vision (also referred to as animated vision) is a special form of active sensing, which has been proposed by Bajcsy (1988) and Ballard (1988), among others (e.g., Blake and Yuille, 1992). In active vision, it is argued that vision is best understood in the context of visual behaviors. The key point to note here is that the task of vision is not to build the model of a surrounding real world as originally postulated in Marr’s theory, but rather to use visual information in the service of the real world in real time, and do so eﬃciently and inexpensively (Clark and Eliasmith, 2003). In eﬀect, the active vision paradigm gives “action” a starring role (Sporns, 2003). Rao and Ballard (1995) proposed an active vision architecture, which is motivated by biological studies. The architecture is based on the hierarchical decomposition of visual behavior involved in scene analysis (i.e., relating internal models to external objects). The architecture employs two components: the “what” component that corresponds to the problem of object identiﬁcation, and the “where” component that corresponds to the problem of objective localization. These two visual components or routines are subserved by two separate memories. The central representation of the architecture is a high-dimensional iconic feature vector, which is comprised of the responses of diﬀerent-order derivatives of Gaussian ﬁlters; the purpose of the iconic feature vector is to provide an eﬀective photometric description of local intensity variations in the image region about an object of interest.

66

3.5

The Machine Cocktail Party Problem

Embodied Intelligent Machines: A Framework for Active Audition

embodied cognitive models

Most of the computational auditory scene analysis (CASA) approaches discussed in the literature share a common assumption: the machine merely listens to the environment but does not interact with it (i.e., the observer is passive). However, as remarked in the previous section, there are many analogies between the mechanisms that go on in auditory perception and their counterparts in visual perception. In a similar vein to active vision, active audition is established on the premise that the observer (human or machine) interacts with the environment, and the machine (in a way similar to human) should also conduct the perception in an active fashion. According to Varela et al. (1991) and Sporns (2003), embodied cognitive models rely on cognitive processes that emerge from interactions between neural, bodily, and environmental factors. A distinctive feature of these models is that they use “the world as their own model.” In particular, embodied cognition has been argued to be the key to the understanding of intelligence (Iida et al., 2004; Pfeifer and Scheier, 1999). The central idea of embodied cognitive machines lies in the observation that “intelligence” becomes meaningless if we exclude ourselves from a real-life scenario; in other words, an intelligent machine is a self-reliant and independent agent capable of adapting itself to a dynamic environment so as to achieve a certain satisfactory goal eﬀectively and eﬃciently, regardless of the initial setup. Bearing this goal in mind, we may now propose a framework for active audition, which embodies four speciﬁc functions: (1) localization and focal attention, (2) segregation, (3) tracking, and (4) learning. In the following, we will address these four functions in turn. 3.5.1

sound localization

Localization and Focal Attention

Sound localization is a fundamental attribute of auditory perception. The task of sound localization can be viewed as a form of binaural depth perception, representing the counterpart to binocular depth perception in vision. A classic model for sound localization was developed by Jeﬀress (1948) using binaural cues such as ITD. In particular, Jeﬀress suggested the use of cross-correlation for calculating the ITD in the auditory system and explained how the model represents the ITD that is received at the ears; the sound processing and representation in Jeﬀress’s model are simple yet elegant, and arguably neurobiologically plausible. Since the essential goal of localization is to infer the directions of incoming sound signals, this function may be implemented by using an adaptive array of microphones, whose design is based on direction of arrival (DOA) estimation algorithms developed in the signal-processing literature (e.g., Van Veen and Buckley, 1988, 1997). Sound localization is often the ﬁrst step to perform the beamforming, the aim of which is to extract the signal of interest produced in a speciﬁc direction. For a robot (or machine) that is self-operating in an open environment, sound localization is essential for successive tasks. An essential ingredient in sound localization is

3.5

Embodied Intelligent Machines: A Framework for Active Audition

67

time-delay estimation when it is performed in a reverberant room environment. To perform this estimation, many signal-processing techniques have been proposed in the literature: Generalized cross-correlation (GCC) method (Knappand and Carter, 1976): This is a simple yet eﬃcient delay-estimation method, which is implemented in the time domain using maximum-likelihood estimation (however, a frequency-domain implementation is also possible). Cross-power spectrum phase (CSP) method (Rabinkin et al., 1996): This delayestimation method is implemented in the frequency domain, which computes the power spectra of two microphone signals and returns the phase diﬀerence between the spectra. Adaptive eigenvalue decomposition (EVD)–based methods (Benesty, 2000; Doclo and Moonen, 2002): It is noted that the GCC and CSP methods usually assume an ideal room model without reverberation; hence they may not perform satisfactorily in a highly reverberant environment. In order to overcome this drawback and enhance robustness, EVD-based methods have been proposed to estimate (implicitly) the acoustic impulse responses using adaptive algorithms that iteratively estimate the eigenvector associated with the smallest eigenvalue. Given the estimated acoustic impulse responses, the time delay can be calculated as the time diﬀerence between the main peak of the two impulse responses or as the peak of the correlation function between the two impulse responses. Upon locating the sound source of interest, the next thing is to focus on the target sound stream and enhance it. Therefore, spatial ﬁltering or beamforming techniques (Van Veen and Buckley, 1997) will be beneﬁcial for this purpose. Usually, with omnidirectional microphone (array) technology, a machine is capable of picking up most if not all of the sound sources in the auditory scene. However, it is hoped that “smart microphones” may be devised so as to adapt their directivity (i.e., autodirective) to the attended speaker considering real-life conversation scenarios. Hence, designing a robust beamformer in a noisy and reverberant environment is crucial for localizing the sound and enhancing the SNR. The adaptivity also naturally brings in the issue of learning, to be discussed in what follows. 3.5.2

Segregation

In this second functional module for active audition, the target sound stream is segregated and the sources of interference are suppressed, thereby focusing attention on the target sound source. This second function may be implemented by using several acoustic cues (e.g., ITD, IID, onset, and pitch) and then combining them in a fusion algorithm. In order to emulate the human auditory system, a computational strategy for acoustic-cue fusion should dynamically resolve the ambiguities caused by the simplecue segregation. The simplest solution is the “winner-take-all” competition, which

68

The Machine Cocktail Party Problem

IID Segregation AND

Pitch Segregation

Onset Segregation

ITD Segregation AND

Figure 3.2

A ﬂowchart of multiple acoustic cues fusion process (courtesy of Rong

Dong). essentially chooses the cue that has the highest (quantitatively) conﬁdence (where the conﬁdence values depend on the speciﬁc model used to extract the acoustic cue). When several acoustic cues are in conﬂict, only the dominant cue will be chosen based on some criterion, such as the weighted-sum mechanism (Woods et al., 1996) that was used for integrating pitch and spatial cues, or the Bayesian framework (Kashino et al., 1998). Recently, Dong (2005) proposed a simple yet eﬀective fusion strategy to solve the multiple cue fusion problem (see ﬁg. 3.2). Basically, the fusion process is performed in a cooperative manner: in the ﬁrst stage of fusion, given IID and ITD cues, the time-frequency units are grouped into two streams (target stream and interference stream), and the grouping results are represented by two binary maps. These two binary maps are then passed through an “AND” operation to obtain a spatial segregation map, which is further utilized to estimate the pitch of the target signal or the pitch of the interference. Likewise, a binary map is produced from the pitch segregation. If the target is detected as an unvoiced signal, onset cue is integrated to group the components into separate streams. Finally, all these binary maps are pooled together by a second “AND” operation to yield the ﬁnal segregation decision. Empirical experiments on this fusion algorithm reported by Dong (2005) have shown very promising results.5 3.5.3 state-space model

Tracking

The theoretical development of sound tracking builds on a state-space model of the auditory environment. The model consists of a process equation that describes the evolution of the state (denoted by xt ) at time t, and a measurement equation that describes the dependence of the observables (denoted by yt ) on the state. More speciﬁcally, the state is a vector of acoustic cues (features) characterizing the

3.5

Embodied Intelligent Machines: A Framework for Active Audition

69

prediction

top-down

prediction

top-down

inference

bottom-up

acoustic and visual cues

An information ﬂowchart integrating “bottom-up” (shaded arrow) and “top-down” ﬂows in a hierarchical functional module of an intelligent machine (in a similar fashion as in the human auditory cortex). Figure 3.3

target sound stream and its direction. Stated mathematically, we have the following state-space equations: xt = f (t, xt−1 , dt ),

(3.1)

yt = g(t, xt , ut , vt ),

(3.2)

where dt and vt denote the dynamic and measurement noise processes, respectively, and the vector ut denotes the action taken by the (passive) observer. The process equation 3.1 embodies the state transition probability p(xt |xt−1 ), whereas the measurement equation 3.2 embodies the likelihood p(yt |xt ). The goal of optimum ﬁltering is then to estimate the posterior probability density p(xt |y0:t ), given the initial prior p(x0 ) and y0:t that denotes the measurement history from time 0 to t. This classic problem is often referred to as “state estimation” in the literature. Depending on the speciﬁc scenario under study, such a hidden state estimation problem can be tackled by using a Kalman ﬁlter (Kalman, 1960), an extended Kalman ﬁlter, or a particle ﬁlter (e.g., Capp´e et al., 2005; Doucet et al., 2001). In Nix et al. (2003), a particle ﬁlter is used as a statistical method for integrating temporal and frequency-speciﬁc features of a target speech signal. The elements of the state represent the azimuth and elevation of diﬀerent sound signals as well as the band-grouped short-time spectrum for each signal; whereas the observable measurements contain binaural short-time spectra of the superposed voice signals. The state equation, representing the spectral dynamics of the speech signal, was learned oﬀ-line using vector quantization and lookup table in a large codebook, where the codebook index for each pair of successive spectra was stored in a Markov transition matrix (MTM); the MTM provides statistical information

70

The Machine Cocktail Party Problem

about the transition probability p(xt |xt−1 ) between successive short-time speech spectra. The measurement equation, characterized by p(yt |xt ), was approximated as a multidimensional Gaussian mixture probability distribution. By virtue of its very design, it is reported in Nix et al. (2003) that the tracker provides a one-step prediction of the underlying features of the target sound. In the much more sophisticated neurobiological context, we may envision that the hierarchical auditory cortex (acting as a predictor) implements an online tracking task as a basis for dynamic feature binding and Bayesian estimation, in a fashion similar to that in the hierarchical visual cortex (Lee and Mumford, 2003; Rao and Ballard, 1999). Naturally, we may also incorporate “top-down” expectation as a feedback loop within the hierarchy to build a more powerful inference/prediction model. This is motivated by the generally accepted fact that the hierarchical architecture is omnipresent in sensory cortices, starting with the primary sensory cortex and proceeding up to the highest areas that encode the most complex, abstract, and stable information.6 A schematic diagram illustrating such a hierarchy is depicted in ﬁg 3.3, where the bottom-up (data-driven) and top-down (knowledge-driven) information ﬂows are illustrated with arrows. In the ﬁgure, the feedforward pathway carries the inference, given the current and past observations; the feedback pathway conducts the prediction (expectation) to lowerlevel regions. To be speciﬁc, let z denote the top-down signal; then the conditional joint probability of hidden state x and bottom-up observation y, given z, may be written as p(x, y|z) = p(y|x, z)p(x|z), Bayes’s rule

(3.3)

and the posterior probability of the hidden state can be expressed via Bayes’s rule: p(x|y, z) =

p(y|x, z)p(x|z) , p(y|z)

(3.4)

where the denominator is a normalizing constant term that is independent of the state x, the term p(x|z) in the numerator characterizes a top-down contextual prior, and the other term p(y|x, z) describes the likelihood of the observation, given all available information. Hence, feedback information from a higher level can provide useful context to interpret or disambiguate the lower-level patterns. The same inference principle can be applied to diﬀerent levels of the hierarchy in ﬁg. 3.3. To sum up, the top-down predictive coding and bottom-up inference cooperate for learning the statistical regularities in the sensory environment; the top-down and bottom-up mechanisms also provide a possible basis for optimal action control within the framework of active audition in a way similar to active vision (Bajcsy, 1988). 3.5.4

Learning

Audition is a sophisticated, dynamic information-processing task performed in the human brain, which inevitably invokes other tasks almost simultaneously (such

3.6

Concluding Remarks

71

B

A internal reward

perceive

think

external reward agent

action a state x

environment

act

Figure 3.4 (A) The sensorimotor feedback loop consisting of three distinct functions: perceive, think, and act. (B) The interaction between the agent and environ-

ment.

reinforcement learning

3.6

as action). Speciﬁcally, it is commonly believed that perception and action are mutually coupled, and integrated via a sensorimotor interaction feedback loop, as illustrated in ﬁg 3.4A. Indeed, it is this unique feature that enables the human to survive in a dynamic environment. For the same reason, it is our belief that an intelligent machine that aims at solving the CPP must embody a learning capability, which must be of a kind that empowers the machine to take action whenever changes in the environment call for it. In the context of embodied intelligence, an autonomous agent is also supposed to conduct a goal-oriented behavior during its interaction with the dynamic environment; hence, the necessity for taking action naturally arises. In other words, the agent has to continue to adapt itself (it terms of action or behavior) to maximize its (internal or external) reward, in order to achieve better perception of its environment (illustrated in ﬁg. 3.4B). Such a problem naturally brings in the theory of reinforcement learning (Sutton and Barto, 1998). For example, imagine a maneuverable machine is aimed at solving a computational CPP in a noisy room environment. The system then has to learn how to adjust its distance and the angle of the microphone array with respect to attended audio sources (such as speech, music, etc.). To do so, the machine should have a built-in rewarding mechanism when interacting with the dynamic environment, and it has to gradually adapt its behavior to achieve a higher (internal and external) reward.7 To sum up, an autonomous intelligent machine self-operating in a dynamic environment will always need to conduct optimal action control or decision making. Since this problem bears much resemblance to the Markov decision process (MDP) (Bellman and Dreyfus, 1962), we may resort to the well-established theory of dynamical programming and reinforcement learning.8

Concluding Remarks In this chapter, we have discussed the machine cocktail party problem and explored possible ways to solve it by means of an intelligent machine. To do so, we have brieﬂy reviewed historical accounts of the cocktail party problem, as well as the important aspects of human and computational auditory scene analysis. More important, we

72

The Machine Cocktail Party Problem

have proposed a computational framework for active audition as an inherent part of an embodied cognitive machine. In particular, we highlighted the essential functions of active audition and discussed its possible implementation. The four functions identiﬁed under the active audition paradigm provide the basis for building an embodied cognitive machine that is capable of human-like hearing in an “active” fashion. The central tenet of active audition embodying such a machine is that an observer may be able to understand an auditory environment more eﬀectively and eﬃciently if the observer interacts with the environment than if it is a passive observer. In addition, in order to build a maneuverable intelligent machine (such as a robot), we also discuss the issue of integrating diﬀerent sensory (auditory and visual) features such that active vision and active audition can be combined in a single system to achieve the true sense of active perception.

Appendix: Reinforcement Learning Mathematically, a Markov decision process

9

is formulated as follows:

Deﬁnition 3.1 A Markov decision process (MDP) is deﬁned as a 6-tuple (S, A, R, p0 , ps , pr ), where S is a (ﬁnite) set of (observable) environmental states (state space), s ∈ S; A is a (ﬁnite) set of actions (action space), a ∈ A; R is a (ﬁnite) set of possible rewards; p0 is an initial probability distribution over S; it is written as p0 (s0 ); ps is a transition probability distribution over S conditioned on a value from S × A; it is also written as pass or ps (st |st−1 , at−1 );10 pr is a probability distribution over R, conditioned on a value from S; it is written as pr (r|s). Deﬁnition 3.2 A policy is a mapping from states to probabilities of selecting each possible action. A policy, denoted by π, can be deterministic, π : S → A, or stochastic, π : S → P (A). An optimal policy π ∗ is a policy that maximizes (minimizes) the expected total reward (cost) over time (within ﬁnite or inﬁnite horizon). Given the above deﬁnitions, the goal of reinforcement learning is to ﬁnd an optimal policy π ∗ (s) for each s, which maximizes the expected reward received over time. We assume that the policy is stochastic, and Q-learning (a special form of reinforcement learning) is aimed at learning a stochastic world in the sense that the ps and pr are both nondeterministic. To evaluate reinforcement learning, a

3.6

Concluding Remarks

73

common measure of performance is the inﬁnite-horizon discounted reward, which can be represented by the state-value function V π (s) = Eπ

∞

γ t r(st , π(st ))s0 = s ,

(3.5)

t=0

where 0 ≤ γ < 1 is a discount factor, and Eπ is the expectation operator over the policy π. The value function V π (s) deﬁnes the expected discounted reward at state s, as shown by V π (s) = Eπ {Rt |st = s},

(3.6)

where Rt = rt+1 + γrt+2 + γ 2 rt+3 + · · · = rt+1 + γ(rt+2 + γrt+2 + · · · ) = rt+1 + γRt+1 . Similarly, one may deﬁne a state-action value function, or the so-called Q-function Qπ (s, a) = Eπ {Rt |st = s, at = a},

(3.7)

for which the goal is not only to achieve a maximal reward but also to ﬁnd an optimal action (supposing multiple actions are accessible for each state). It can be shown that π(s, s )V π (s ), V π (s) = Eπ {r(s, π(s))} + γ π

Q (s, a) =

s

pass [R(s, a, s )

+ γV π (s )],

s ∈S

V π (s) =

π(s, a)Qπ (s, a) =

a∈A

and

π(s, a) rsa + γ pass V π (s ) , s

a∈A

which correspond to diﬀerent forms of the Bellman equation (Bellman and Dreyfus, 1962). Note that if the state or action is continuous-valued, the summation operations are replaced by corresponding integration operations. The optimal value functions are then further deﬁned as: ∗

V π (s) = max V π (s), π

∗

Qπ (s, a) = max Qπ (s, a). π

Therefore, in light of dynamic programming theory (Bellman and Dreyfus, 1962), the optimal policy is deterministic and greedy with respect to the optimal value functions. Speciﬁcally, we may state that given the state s and the optimal policy π ∗ , the optimal action is selected according to the formula ∗

a∗ = arg max Qπ (s, a), a∈A

∗

∗

such that V π (s) = max Qπ (s, a). a∈A

74

The Machine Cocktail Party Problem

A powerful reinforcement learning tool for tackling the above-formulated problem is the Q-learning algorithm (Sutton and Barto, 1998; Watkins, 1989). The classic Q-learning is an asynchronous, incremental, and approximate dynamic programming method for stochastic optimal control. Unlike the traditional dynamic programming, Q-learning is model-free in the sense that its operation requires neither the state transition probability nor the environmental dynamics. In addition, Q-learning is computationally eﬃcient and can be operated in an online manner (Sutton and Barto, 1998). For ﬁnite state and action sets, if each (s, a) pair is visited inﬁnitely and the step-size sequence used in Q-learning is nonincreasing, then Q-learning is assured to converge to the optimal policy with probability 1 (Tsitsiklis, 1994; Watkins and Dayan, 1992). For problems with continuous state, functional approximation methods can be used for tackling the generalization issue; see Bertsekas and Tsitsiklis (1996) and Sutton and Barto (1998) for detailed discussions. Another model-free reinforcement learning algorithm is the actor-critic model (Sutton and Barto, 1998), which describes a bootstrapping strategy for reinforcement learning. Speciﬁcally, the actor-critic model has separate memory structures to represent the policy and the value function: the policy structure is conducted within the actor that selects the optimal actions; the value function is estimated by the critic that criticizes the actions made by the actor. Learning is always on-policy in that the critic uses a form of temporal diﬀerence (TD) error to maximize the reward, whereas the actor will use the estimated value function from the critic to bootstrap itself for a better policy. A much more challenging but more realistic reinforcement learning problem is the so-called partially observable Markov decision process (POMDP). Unlike the MDP that assumes the full knowledge of observable states, POMDP addresses the stochastic decision making and optimal control problems with only partially observable states in the environment. In this case, the elegant Bellman equation does not hold since it requires a completely observable Markovian environment (Kaelbling, 1993). The literature of POMDP is intensive and ever growing, hence it is beyond the scope of the current chapter to expound this problem; we refer the interested reader to the papers (Kaelbling et al., 1998; Lovejoy, 1991; Smallwood and Sondik, 1973) for more details. Acknowledgments This chapter grew out of a review article (Haykin and Chen, 2005). In particular, we would like to thank our research colleagues R. Dong and S. Doclo for valuable feedback. The work reported here was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

NOTES

Notes

75

1 For

tutorial treatments of the cocktail party problem, see Chen (2003), Haykin and Chen (2005), and Divenyi (2004). 2 Categorization of these three neural processes, done essentially for research-related studies, is somewhat artiﬁcial; the boundary between them is fuzzy in that the brain does not necessarily distinguish between them as deﬁned herein. 3 In Cherry’s original paper, “intelligibility” is referred to as the probability of correctly identifying meaningful speech sounds; in contrast, “articulation” is referred to as a measure of nonsense speech sounds (Cherry, 1953). 4 As pointed out by Bajcsy (1988), the proposition that active sensing is an application of intelligent control theory may be traced to the PhD thesis of Tenenbaum (1970). 5 At McMaster University we are currently exploring the DSP hardware implementation of the fusion scheme that is depicted in ﬁg. 3.2. 6 The “top-down” inﬂuence is particularly useful for (1) synthesizing missing information (e.g., the auditory “ﬁll-in” phenomenon); (2) incorporating contextual priors and inputs from other sensory modalities; and (3) resolving perceptual ambiguities whenever lower-level information leads to confusion. 7 The reward can be a measure of speech intelligibility, signal-to-interference ratio, or some sort of utility function. 8 Reinforcement learning is well known in the machine learning community, but, regrettably, not so in the signal processing community. An appendix on reinforcement learning is included at the end of the chapter largely for the beneﬁt of readers who may not be familiar with this learning paradigm. 9 For the purpose of exposition simplicity, we restrict our discussion to ﬁnite discrete state and action spaces, but the treatment also applies to the more general continuous state or action space. 10 In the case of ﬁnite discrete state, p constitutes a transition matrix. s

4

Sensor Adaptive Signal Processing of Biological Nanotubes (Ion Channels) at Macroscopic and Nano Scales

Vikram Krishnamurthy

Ion channels are biological nanotubes formed by large protein molecules in the cell membrane. All electrical activities in the nervous system, including communications between cells and the inﬂuence of hormones and drugs on cell function, are regulated by ion channels. Therefore understanding their mechanisms at a molecular level is a fundamental problem in biology. This chapter shows how dynamic stochastic models and associated statistical signal-processing techniques together with novel learning-based stochastic control methods can be used to understand the structure and dynamics of ion channels at both macroscopic and nanospatial scales. The unifying theme of this chapter is the concept of sensor adaptive signal processing, which deals with sensors dynamically adjusting their behavior so as to optimize their ability to extract signals from noise.

4.1

Introduction All living cells are surrounded by a cell membrane, composed of two layers of phospholipid molecules, called the lipid bilayer. Ion channels are biological nanotubes formed by protein macromolecules that facilitate the diﬀusion of ions across the cell membrane. Although we use the term biological nanotube, ion channels are typically the size of angstrom units (10−10 m), i.e., an order of magnitude smaller in radius and length compared to carbon nanotubes that are used in nanodevices. In the past few years, there have been enormous strides in our understanding of the structure-function relationships in biological ion channels. These advances have been brought about by the combined eﬀorts of experimental and computational biophysicists, who together are beginning to unravel the working principles of these

78

Sensor Adaptive Signal Processing of Biological Nanotubes

exquisitely designed biological nanotubes that regulate the ﬂow of charged particles across the cell membrane. The measurement of ionic currents ﬂowing through single ion channels in cell membranes has been made possible by the gigaseal patchclamp technique (Hamill et al., 1981; Neher and Sakmann, 1976). This was a major breakthrough for which the authors Neher and Sakmann won the 1991 Nobel Prize in Medicine (Neher and Sakmann, 1976). More recently, the 2003 Nobel Prize in Chemistry was awarded to MacKinnon for determining the structure of several diﬀerent types of ion channels (including the bacterial potassium channel; Doyle et al. (1998)) from crystallographic analyses. Because all electrical activities in the nervous system, including communications between cells and the inﬂuence of hormones and drugs on cell function, are regulated by membrane ion channels, understanding their mechanisms at a molecular level is a fundamental problem in biology. Moreover, elucidation of how single ion channels work will ultimately help neurobiologists ﬁnd the causes of, and possibly cures for, a number of neurological and muscular disorders. We refer the reader to the special issue of IEEE Transactions on NanoBioScience Krishnamurthy et al. (2005) for an excellent up-to-date account of ion channels written by leading experts in the area. This chapter addresses two fundamental problems in ion channels from a statistical signal processing and stochastic control (optimization) perspective: the gating problem and the ion permeation problem. The gating problem (Krishnamurthy and Chung, 2003) deals with understanding how ion channels undergo structural changes to regulate the ﬂow of ions into and out of a cell. Typically a gated ion channel has two states: a “closed” state which does not allow ions to ﬂow through, and an “open” state which does allow ions to ﬂow through. In the open state, the ion channel currents are typically of the order of pico-amps (i.e., 10−12 amps). The measured ion channel currents (obtained by sampling typically at 10 kHz, i.e, 0.1 millisecond time scale) are obfuscated by large amounts of thermal noise. In sections 4.2 4.3 of this chapter, we address the following issues related to the gating problem: (1) We present a hidden Markov model (HMM) formulation of the observed ion channel current. (2) We present in section 4.2 a discrete stochastic optimization algorithm for controlling a patch-clamp experiment to determine the Nernst potential of the ion channel with minimal eﬀort. This ﬁts in the class of so-called experimental design problems. (3) In section 4.3, we brieﬂy discuss dynamic scheduling algorithms for activating multiple ion channels on a biological chip so as to extract maximal information from them. The permeation problem (Allen et al., 2003; O’Mara et al., 2003) seeks to explain the working of an ion channel at an ˚ A(10−10 m) spatial scale by studying the propagation of individual ions through the ion channel at a femto (10−15 ) second time scale. This setup is said to be at a mesoscopic scale since the individual Ain radius and are comparable in ions (e.g., Na+ ions) are of the order of a few ˚ radius to the ion channel. At this mesoscopic level, point-charge approximations and continuum electrostatics break down. The discrete ﬁnite nature of each ion

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

sensor adaptive signal processing

4.2

79

needs to be taken into consideration. Also, failure of the mean ﬁeld approximation in narrow channels implies that any theory that aspires to relate channel structure to its function must treat ions explicitly. In sections 4.4, 4.5, and 4.6 of this chapter, we discuss the permeation problem for ion channels. We show how Brownian dynamics simulation can be used to model the propagation of individual ions. We also show how stochastic gradient learning based schemes can be used to control the evolution of Brownian dynamics simulation to predict the molecular structure of an ion channel. We refer the reader to our recent research (Krishnamurthy and Chung, a,b) where a detailed exposition of the resulting adaptive Brownian dynamics simulation algorithm is given. Furthermore numerical results presented in Krishnamurthy and Chung (a,b) for antibiotic Gramicidin-Aion channels show that the estimates obtained from the adaptive Brownian dynamics algorithm are consistent with the known molecular structure of Gramicidin-A. An important underlying theme of this chapter is the ubiquitous nature of sensor adaptive signal processing. This transcends standard statistical signal processing, which deals with extracting signals from noisy observations, to examine the deeper problem of how to dynamically adapt the sensor to optimize the performance of the signal-processing algorithm. That is, the sensors dynamically modify their behavior to optimize their performance in extracting the underlying signal from noisy observations. A crucial aspect in sensor adaptive signal processing is feedback—past decisions of adapting the sensor aﬀect future observations. Such sensor adaptive signal processing has recently been used in defense networks (Evans et al., 2001; Krishnamurthy, 2002, 2005) for scheduling sophisticated multimode sensors in unattended ground sensor networks, radar emission control, and adaptive radar beam allocation. In this chapter we show how the powerful paradigm of sensor adaptive signal processing can be successfully applied to biological ion channels both at the macroscopic and nano scales.

The Gating Problem and Estimating the Nernst Potential of Ion Channels In this section we ﬁrst outline the well-known hidden Markov model (HMM) for modeling the ion channel current in the gating problem. We refer the reader to the paper by Krishnamurthy and Chung (2003) for a detailed exposition. Estimating the underlying ion channel current from the noisy HMM observations is a wellstudied problem in HMM signal processing (Ephraim and Merhav, 2002; James et al., 1996; Krishnamurthy and Yin, 2002). In this section, consistent with the theme of sensor adaptive signal processing, we address the deeper issue of how to dynamically control the behavior of the ion channels to extract maximal information about their behavior. In particular we propose two novel applications of stochastic control for adapting the behavior of the ion channel. Such ideas are also relevant in other applications such as sensor scheduling in defense networks (Krishnamurthy, 2002, 2005).

80

Sensor Adaptive Signal Processing of Biological Nanotubes

4.2.1 ion channel current

Hidden Markov Model Formulation of Ion Channel Current

The patch clamp is a device for isolating the ion channel current from a single ion channel. A typical trace of the ion channel current measurement from a patchclamp experiment (after suitable anti-aliasing ﬁltering and sampling) shows that the channel current is a piecewise constant discrete time signal that randomly jumps between two values—zero amperes, which denotes the closed state of the channel, and I(θ) amperes (typically a few pico-amperes), which denotes the open state. I(θ) is called the open-state current level. Sometimes the current recorded from a single ion channel dwells on one or more intermediate levels, known as conductance substates. Chung et al. (1990, 1991) ﬁrst introduced the powerful paradigm of hidden Markov models (HMMs) to characterize patch-clamp recordings of small ion channel currents contaminated by random and deterministic noise. By using sophisticated HMM signal-processing methods, Chung et al. (1990, 1991) demonstrated that the underlying parameters of the HMM could be obtained to a remarkable precision despite the extremely poor signal-to-noise ratio. These HMM parameter estimates yield important information about the dynamics of ion channels. Since the publications of Chung et al. (1990, 1991), several papers have appeared in the neurobiological community that generalize the HMM signal models in Chung et al. (1990, 1991) in various ways to model measurements of ion channels (see the paper of Venkataramanan et al. (2000) and the references therein). With these HMM techniques, it is now possible for neurobiologists to analyze not only large ion channel currents but also small conductance ﬂuctuations occurring in noise. Markov Model for Ion Channel Current Suppose a patch-clamp experiment is conducted with a voltage θ applied across the ion channel. Then, as described in Chung et al. (1991) and in Venkataramanan et al. (2000), the ion channel current {in (θ)} can be modeled as a three-state homogeneous ﬁrst-order Markov chain. The state space of this Markov chain is {0g , 0b , I(θ)}, corresponding to the physical states of gap mode, burst-mode-closed, and burst-mode-open. For convenience, we will refer to the burst-mode-closed and burst-mode-open states as the closed and open states, respectively. In the gap mode and the closed state, the ion channel current is zero. In the open state, the ion channel current has a value of I(θ). (θ,λ) The 3 × 3 transition probability matrix A(θ) of the Markov chain {In (θ)}, which governs the probabilistic behavior of the channel current, is given by 0g A(θ) =

0b

I(θ)

0g

a11 (θ) a12 (θ)

0b

a21 (θ) a22 (θ) a23 (θ)

I(θ)

0

0

(4.1)

a32 (θ) a33 (θ) (θ,λ)

The elements of A(θ) are the transition probabilities aij (θ) = P (In+1 (θ) = (θ,λ) j|In (θ) = i) where i, j ∈ {0g , 0b , I(θ)}. The zero probabilities in the above

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

hidden Markov model (HMM)

81

matrix A(θ) reﬂect the fact that a ion channel current cannot directly jump from the gap mode to the open state; similarly an ion channel current cannot jump from the open state to the gap mode. Note that in general, the applied voltage θ aﬀects both the transition probabilities and state levels of the ion channel current (θ,λ) {In (θ)}. Hidden Markov Model (HMM) Observations Let {yn (θ)} denote the measured noisy ion channel current at the electrode when conducting a patch-clamp experiment: yn (θ) = in (θ) + wn (θ),

n = 1, 2, . . .

(4.2)

Here {wn (θ)} is thermal noise and is modeled as zero-mean white Gaussian noise with variance σ 2 (θ). Thus the observation process {yn (θ)} is a hidden Markov model (HMM) sequence parameterized by the model λ(θ) = {A(θ), I(θ), σ 2 (θ)}

(4.3)

where θ denotes the applied voltage. We remark here that the formulation trivially extends to observation models where the noise process wn (θ) includes a time-varying deterministic component together with white noise — only the HMM parameter estimation algorithm needs to be modiﬁed as in Krishnamurthy et al. (1993). HMM Parameter Estimation of Current Level I(θ) Given the HMM mode for the ion channel current above, estimating I(θ) for a ﬁxed voltage θ involves processing the noisy observation {yn (θ)} through a HMM maximum likelihood parameter estimator. The most popular way of computing the maximum likelihood estimate (MLE) I(θ) is via the expectation maximization (EM) algorithm (Baum Welch equations). The EM algorithm is an iterative algorithm for computing the MLE. It is now fairly standard in the signal-processing and neurobiology literature—see Ephraim and Merhav (2002) for a recent exposition, or Chung et al. (1991), which is aimed at neurobiologists. Let IˆΔ (θ) denote MLE of I(θ) based on the Δ-point measured channel current sequence (y1 (θ), . . . , yΔ (θ)). For suﬃciently large batch size Δ of observations, due to the asymptotic normality of the MLE for a HMM (Bickel et al., 1998), √

Δ IˆΔ (θ) − I(θ) ∼ N (0, Σ(θ)), (4.4) where Σ−1 (θ) is the Fisher information Thus asymptotically IˆΔ (θ) is matrix. an unbiased estimator of I(θ), i.e., E IˆΔ (θ) = I(θ) where E {·} denotes the mathematical expectation operator. 4.2.2

Nernst Potential and Discrete Stochastic Optimization for Ion Channels

To record currents from single ion channels, the tip of an electrode, with the diameter of about 1 μm, is pushed against the surface of a cell, and then a tight

82

Sensor Adaptive Signal Processing of Biological Nanotubes

seal is formed between the rim of the electrode tip and the cell membrane. A patch of the membrane surrounded by the electrode tip usually contains one or more single ion channels. The current ﬂowing from the inside of the cell to the tip of the electrode through a single ion channel is monitored. This is known as cell-attached conﬁguration of patch-clamp techniques for measuring ion channel currents through a single ion channel. Figure 4.1 shows the schematic setup of the cell in electrolyte and the electrode pushed against the surface of the cell.

−

co Eo

+

Electrode

cell ci , Ei

Figure 4.1

Nernst potential

Cell-attached patch experimental setup.

In a living cell, there is a potential diﬀerence between its interior and the outside environment, known as the membrane potential. Typically, the cell interior is about 60 mV more negative with respect to outside. Also, the ionic concentrations (mainly Na+ , Cl− , and K+ ) inside of a cell are very diﬀerent from outside of the cell. In the cell-attached conﬁguration, the ionic strength in the electrode is usually made the same as that in the outside of the cell. Let Ei and Eo , respectively, denote the resting membrane potential and the potential applied to the electrode. If Eo is identical to the membrane potential, there will be no potential gradient across the membrane patch conﬁned by the tip of the electrode. Let ci denote the intracellular ionic concentration and co the ionic concentration in the electrode. Here the intracellular concentration ci inside the cell is unknown as is the resting membrane potential Ei . co and Eo are set by the experimenter and are known. Let θ = Eo − Ei denote the potential gradient. Both the potential gradient θ and concentration gradient co − ci drive ions across an ion channel, resulting in (θ,λ) an ion channel current {In (θ)}. This ion channel current is a piecewise constant signal that jumps between the values of zero and I(θ), where I(θ) denotes the current when the ion channel is in the open state. The potential Eo (and hence potential diﬀerence θ) is adjusted experimentally until the current I(θ) goes to zero. This voltage θ∗ at which the current I(θ∗ ) vanishes is called the Nernst potential and satisﬁes the so-called Nernst equation θ∗ = −

co co kT ln = −59 log10 (mV), e ci ci

(4.5)

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

83

where e = 1.6 × 10−19 C denotes the charge of an electron, k denotes Boltzmann’s constant, and T denotes the absolute temperature. The Nernst equation (4.5) gives the potential diﬀerence θ required to maintain electrochemical equilibrium when the concentrations are diﬀerent on the two faces of the membrane. Estimating the Nernst potential θ∗ requires conducting experiments at diﬀerent values of voltage θ. In patch-clamp experiments, the applied voltage θ is usually chosen from a ﬁnite set. Let θ ∈ Θ = {θ(1), . . . , θ(M )} denote the ﬁnite set of possible voltage values that the experimenter can pick. For example, in typical experiments, if one needs to determine the Nernst potential to a resolution of 4 mV, then M = 80 and θ(i) are uniformly spaced in 4 mV steps from θ(1) = −160 mV and θ(M ) = 160 mV. Note that the Nernst potential θ∗ (zero crossing point) does not necessarily belong to the discrete set Θ—instead we will ﬁnd the point in Θ that is closest to θ∗ (with resolution θ(2) − θ(1)). With slight abuse of notation we will denote the element in Θ closest to the Nernst potential as θ∗ . Thus determining θ∗ ∈ Θ can be formulated as a discrete optimization problem: θ∗ = arg min |I(θ)|2 . θ∈Θ

discrete stochastic approximation

Discrete Stochastic Approximation Algorithm Learning the Nernst Potential can be formulated as the following discrete stochastic optimization problem 2 ˆ , Compute θ = arg min E I(θ) ∗

θ∈Θ

(4.6)

ˆ where I(θ) is the MLE of the parameter I(θ) of the HMM. Since for a HMM, no closed-form expression is available for Σ−1 (θ) in equation 4.4, the above expectation cannot be evaluated analytically. This motivates the need to develop a simulationbased (stochastic approximation) algorithm. We refer the reader to the paper by Krishnamurthy and Chung (2003) for details. The idea of discrete stochastic approximation (Andradottir, 1999) is to design a plan of experiments which provides more observations in areas where the Nernst potential is expected and less in other areas. More precisely what is needed is a dynamic resource allocation (control) algorithm that dynamically controls (schedules) the choice of voltage at which the HMM estimator operates in order to eﬃciently obtain the zero point and deduce how the current increases or decreases as the applied voltage deviates from the Nernst potential. We propose a discrete stochastic approximation algorithm that is both consistent and attracted to the Nernst potential. That is, the algorithm should spend more time gathering observations {yn (θ)} at the Nernst potential θ = θ∗ and less time for other values of θ ∈ Θ. Thus in discrete stochastic approximation the aim is to devise an eﬃcient (Pﬂug, 1996, chapter 5.3) adaptive search (sampling plan) which allows ﬁnding the mini-

84

Sensor Adaptive Signal Processing of Biological Nanotubes

mizer θ∗ with as few samples as possible by not making unnecessary observations at nonpromising values of θ. Here we construct algorithms based on the random search procedures of Andradottir (1995, 1999). The basic idea is to generate a homogeneous Markov chain taking values in Θ which spends more time at the global optimum than at any other element of Θ. We will show that these algorithms can be modiﬁed for tracking time-varying Nernst potentials. Finally, it is worthwhile mentioning that there are other classes of simulation-based discrete stochastic optimization algorithms, such as nested partition methods (Swisher et al., 2000), which combine partitioning, random sampling, and backtracking to create a Markov chain that converges to the global optimum. Let n = 1, 2, . . . denote discrete time. The proposed algorithm is recursive and requires conducting experiments on batches of data. Since experiments will be conducted over batches of data, it is convenient to introduce the following notation. Group the discrete time into batches of length Δ—typically Δ = 10, 000 in experiments. We use the index N = 1, 2, . . . to denote batch number. Thus batch N comprises the Δ discrete time instants n ∈ {N Δ, N Δ + 1, . . . , (N + 1)Δ − 1}. Let DN = (DN (1), . . . , DN (M )) denote the vector of duration times the algorithm spends at the M possible potential values in Θ. Finally, for notational convenience deﬁne the M dimensional unit vectors, em , m = 1, . . . , M as em = 0 · · · 0 1 0 · · · 0 , (4.7) with 1 in the mth position and zeros elsewhere. The discrete stochastic approximation algorithm of Andradottir (1995) is not directly applicable to the cost function 4.6, since it applies to optimization problems of the form minθ∈Θ E {C(θ)}. However, equation 4.6 can easily be converted to this form as follows: Let Iˆ1 (θ), Iˆ2 (θ) be two statistically independent unbiased HMM ˆ estimates of I(θ). Then deﬁning C(θ) = Iˆ1 (θ)Iˆ2 (θ), it straightforwardly follows that 2 ˆ ˆ E C(θ) = E I(θ) = |I(v)|2 . (4.8) The discrete stochastic approximation algorithm we propose is as follows: Algorithm 4.1 Algorithm for Learning Nernst Potential Step 0 Initialization: At batch-time N = 0, select starting point X0 ∈ {1, . . . , M } randomly. Set D0 = eX0 , Set initial solution estimate θ0∗ = θ(X0 ). ˜ N ∈ {XN − 1, XN + 1} with uniform Step 1 Sampling: At batch-time N , sample X distribution. ˜ N ) to patch-clamp Step 2 Evaluation and acceptance: Apply voltage θ˜ = θ(X (1) ˜ and experiment. Obtain two Δ length batches of HMM observations. Let IˆN (θ) (2) ˜ ˆ IN (θ) denote the HMM-MLE estimates for these two batches, which are computed using the EM algorithm (James et al., 1996; Krishnamurthy and Chung, 2003). Set ˜ = Iˆ(1) (θ) ˜ Iˆ(2) (θ). ˜ CˆN (θ)) N N

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

85

Then apply voltage θ = θ(XN ). Compute the HMM-MLE estimates for these (1) (2) (1) (2) two batches, denoted as IˆN (θ) and IˆN (θ). Set CˆN (θ)) = IˆN (θ)IˆN (θ). ˜ < CˆN (θ), set XN +1 = X ˜ N , else, set XN +1 = XN . If CˆN (θ) Step 3 Update occupation probabilities of XN : DN +1 = DN + eXN +1 . ∗ Step 4 Update estimate of Nernst potential: θN = θ(m∗ ) where m∗ = arg

max

m∈{1,... ,M }

DN +1 (m).

Set N → N + 1. Go to step 1. The proof of convergence of the algorithm is given in theorem 4.1 below. The main idea behind the above algorithm is that the sequence {XN } (or equivalently {θ(XN )}) generated by steps 1 and 2 is a homogeneous Markov chain with state space {1, . . . , M } (respectively, Θ) that is designed to spend more time at the global ∗ denotes the estimate maximizer θ∗ than any other state. In the above algorithm, θˆN of the Nernst potential at batch N . Interpretation of Step 3 as Decreasing Step Size Adaptive Filtering Algorithm Deﬁne the occupation probability estimate vector as π ˆN = DN /N . Then the update in step 3 can be reexpressed as ˆ0 = eX0 . π ˆN +1 = π ˆN + μN +1 eXN +1 − π ˆN , π (4.9) This is merely an adaptive ﬁltering algorithm for updating πˆN with decreasing step size μN = 1/N . Hence algorithm 4.1 can be viewed as a decreasing step size algorithm which involves a least mean squares (LMS) algorithm (with decreasing step size) in tandem with a random search step and evaluation (steps 1 and 2) for generating Xm . Figure 4.2 shows a schematic diagram of the algorithm with this LMS interpretation for step 3. In Andradottir (1995), the following stochastic ordering assumption was used for convergence of the algorithm 4.1. (O) For any m ∈ {1, . . . , M − 1},

ˆ ˆ + 1)) > C(θ(m)) > 0.5, I 2 (θ(m + 1)) > I 2 (θ(m)) =⇒ P C(θ(m ˆ ˆ + 1)) > C(θ(m)) < 0.5 I 2 (θ(m + 1)) < I 2 (θ(m)) =⇒ P C(θ(m

Step 1. Sample X˜ N from {XN –1, XN +1}

Figure 4.2

X˜ N

Step 2. Run patch clamp expt at voltages θ (XN) and θ (X˜ N).

Step 2. HMM MLE estimator. Evaluate CˆN (θ˜ ) ˆ θ ). and C(

Schematic of algorithm 4.1.

XN

Step 3. Adaptive filter step size μN

πˆM

max

θˆN*

86

Sensor Adaptive Signal Processing of Biological Nanotubes

Theorem 4.1 Under the condition (O) above, the sequence {θ(XN )} generated by algorithm 4.1 is a homogeneous, aperiodic, irreducible Markov chain with state space Θ. Furthermore, algorithm 4.1 is attracted to the Nernst potential θ∗ , i.e., for suﬃciently large N , the sequence {θ(XN )} spends more time at θ∗ than at another state. (Equivalently, if θ(m∗ ) = θ∗ , then DN (m∗ ) > DN (j) for j ∈ {1, . . . , M } − {m∗ }.) The above discrete stochastic approximation algorithm can be viewed as the discrete analog of the well-known LMS algorithm. Recall that in the LMS algorithm, the new estimate is computed from the previous estimate by moving along a desirable search direction (based on gradient information). In complete analogy with the above discrete search algorithm, the new estimate is obtained by moving along a discrete search direction to a desirable new point. We refer the reader to Krishnamurthy and Chung (2003) and our recent papers (Krishnamurthy et al., 2004; Yin et al., 2004) for complete convergence details of the above discrete stochastic approximation algorithm.

4.3

Scheduling Multiple Ion Channels on a Biological Chip In this section, we consider dynamic scheduling and control of the gating process of ion channels on a biological chip. Patch clamping has rapidly become the “gold standard” (Fertig et al., 2002) for study of the dynamics of ion channel function by neurobiologists. However, patch clamping is a laborious process requiring precision micromanipulation under high-power visual magniﬁcation, vibration damping, and an experienced, skillful experimenter. Because of this, high-throughput studies required in proteomics and drug development have to rely on less valuable methods such as ﬂuorescence-based measurement of intracellular ion concentrations (Xu et al., 2001). There is thus signiﬁcant interest in an automated version of the whole patch-clamp principle, preferably one that has the potential to be used in parallel on a number of cells. In 2002, Fertig et al. (2002) made a remarkable invention—the ﬁrst successful demonstration of a patch clamp on a chip—a planar quartz-based biological chip that consists of several hundred ion channels (Sigworth and Klemic, 2002). This patch-clamp chip can be used for massively parallel screens for ion channel activity, thereby providing a high-throughput screening tool for drug discovery eﬀorts. Typically, because of their high cost, most neurobiological laboratories have only one patch-clamp ampliﬁer that can be connected to the patch-clamp chip. As a result, only one ion channel in the patch-clamp chip can be monitored at a given time. It is thus of signiﬁcant interest to devise an adaptive scheduling strategy that dynamically decides which single ion channel to activate at each time instant in order to maximize the throughput (information) from the patch-clamp experiment. Such a scheduling strategy will enable rapid evaluation and screening of drugs. Note that this problem directly ﬁts into our main theme of sensor adaptive signal

Scheduling Multiple Ion Channels on a Biological Chip

87

LASER

4.3

Cell Membrane

Ion Channel

Caged Glutamate

Electrolyte Solution

− +

Stochastic Scheduler

Amplifier

Figure 4.3

One-dimensional section of planar biological chip.

processing. Here we consider the problem of how to dynamically schedule the activation of individual ion channels using a laser beam to maximize the information obtained from the patch-clamp chip for high-throughput drug evaluation. We refer the reader to Krishnamurthy (2004) for a detailed exposition of the problem together with numerical studies. The ion channel activation scheduling algorithm needs to dynamically plan and react to the presence of uncertain (random) dynamics of the individual ion channels in the chip. Moreover, excessive use of a single ion channel can make it desensitized. The aim is to answer the following question: How should the ion channel activation scheduler dynamically decide which ion channel on the patch clamp chip to activate at each time instant in order to minimize the overall desensitization of channels while simultaneously extracting maximum information from the channels? We refer the reader to Fertig et al. (2002) for details on the synthesis of a patchclamp chip. The chip consists of a quartz substrate of 200 micrometers thickness that is perforated by wet etching techniques resulting in apertures with diameters of approximately 1 micrometer. The apertures replace the tip of glass pipettes commonly used for patch-clamp recording. Cells are positioned onto the apertures from suspension by application of suction. A schematic illustration of the ion channel scheduling problem for the patchclamp chip is given in ﬁg. 4.3. The ﬁgure shows a cross section of the chip with 4 ion channels. The planar chip could, for example, consist of 50 rows each containing 4 ion channels. Each of the four wells contains a membrane patch with an ion channel. The external electrolyte solutions contain caged ligands (such as caged glutamate). When a beam of laser is directed at the well, the inert caged ligands become

88

Sensor Adaptive Signal Processing of Biological Nanotubes

active ligands that cause a channel to go from the closed conformation to an open conformation. Ions then ﬂow across the open channel, and the current generated by the motion of charged particles is monitored with a patch-clamp ampliﬁer. The ampliﬁer is switched to the output of one well to another electronically. Typically, the magnitude of currents across each channel, when it is open, is about 1 pA (10−12 A). The design of the ion channel activation scheduling algorithm needs to take into account the following subsystems. Heterogeneous ion channels (macro-molecules) on chip: In a patch-clamp chip, the dynamical behavior of individual ion channels that are activated changes with time since they can become desensitized due to excessive use. Deactivated ion channels behave quite diﬀerently from other ion channels. Their transition to the open state becomes less frequent when they are de-sensitized due to excessive use. Patch-clamp ampliﬁer and heterogeneous measurements: The channel current of the activated ion channel is of the order of pico-amps and is measured in large amounts of thermal noise. Chung et al. (1990, 1991), used the powerful paradigm of HMMs to characterize these noisy measurements of single ion channel currents. The added complexity in the patch-clamp chip is that the signal-to-noise ratio is diﬀerent at diﬀerent parts of the chip—meaning that certain ion channels have higher SNR than other ion channels. Ion channel activation scheduler: The ion channel activation scheduler uses the noisy channel current observations of the activated ion channel in the patch-clamp chip to decide which ion channel to activate at the next time instant to maximize a reward function that comprises the information obtained from the experiment. It needs to avoid activating desensitized channels, as they yield less information. 4.3.1

Stochastic Dynamical Models for Ion Channels on Patch-Clamp Chip

In this section we formulate a novel Markov chain model for the ion channels that takes into account both the ion channel current state and the ion channel sensitivity. The patch-clamp chip consists of P ion channels arranged in a two-dimensional grid indexed by p = 1, . . . , P . Let k = 0, 1, 2, . . . , denote discrete time. At each time instant k the scheduler decides which single ion channel to activate by directing a laser beam on the ion channel as described above. Let uk ∈ {1, . . . , P } denote the ion channel that is activated by the scheduler at time k. The remaining P − 1 ion channel channels on the chip are inactive. It is the job of the dynamic scheduler to dynamically decide which ion channel should be activated at each time instant k in order to maximize the amount of information that can be obtained from the chip. If channel p is active at time k, i.e., uk = p, the following two mechanisms determine the evolution of this active ion channel. Ion Channel Sensitivity Model The longer the channel is activated, the more (p) probably it becomes desensitized. Let dk ∈ {normal, de-sens} denote the sensitivity of ion channel p at any time instant k. If ion channel p is activated at time

4.3

Scheduling Multiple Ion Channels on a Biological Chip

89

(p)

k, i.e, uk = p, then dk can be modeled as a two-state Markov chain with state transition probability matrix normal de-sens D = normal de-sens

0 ≤ d11 ≤ 1.

1 − d11 ,

d11 0

(4.10)

1

The above transition probabilities reﬂect the fact that if the channel is overused it becomes desensitized with probability d12 . The 0, 1 in the second row imply that once the channel is desensitized, it remains de-sensitized. Note that the sensitivity (q) (q) of the inactive channels remains ﬁxed, i.e., dk+1 = dk , q = p. Ion Channel Current Model Suppose channel p is active at time k, i.e., uk = p. (p) Let ik ∈ {0, I} = {closed, open} denote the channel current. As is well known (Chung et al., 1991), the channel current is a binary valued signal that switches between zero “closed state” and the current level I “open state.” The open-state current level I is of importance to neurobiologists since it quantiﬁes the eﬀect of (p) a drug on the ion channel. Moreover, ik can be modeled as a two-state Markov (p) chain (Chung et al., 1991) conditional on dk with transition probability matrix (p) (p) (p) Q = (P (ik+1 |ik , dk+1 )) given by (p) Q(dk+1

(p) Q(dk+1

closed

open

= normal) = closed

q11

q12

open

q21

q22

closed

open

= de-sens) = closed

q¯11

q¯12 .

open

q¯21

q¯22

(4.11)

For each ion channel p ∈ {1, . . . , P } on the patch clamp chip, deﬁne the ion (p) (p) (p) channel state as the vector Markov process sk = (dk , ik ) with state space {(normal,closed), (normal,open), (de-sens,closed), (de-sens,open)} = {1, 2, 3, 4} where for notational convenience we have mapped the four states to {1, 2, 3, 4}. It is clear (u ) that only the state sk k of the ion channel that is activated evolves with time. Since (p)

(p)

(p)

(p)

(p)

(p)

(p)

P (sk+1 |sk ) = P (dk+1 |dk )P (ik+1 |ik , dk+1 ) (p)

if channel p is active at time k, i.e., uk = p, then sk has transition probability matrix ⎤ ⎡ d11 q11 d11 q12 (1 − d22 )¯ q11 (1 − d11 )¯ q12 ⎥ ⎢ ⎢d11 q21 d11 q22 d12 q¯21 d12 q¯22 ⎥ (p) ⎥. ⎢ (4.12) A =⎢ ⎥ 0 q¯11 q¯12 ⎦ ⎣ 0 0 0 q¯21 q¯22

90

Sensor Adaptive Signal Processing of Biological Nanotubes (p)

More generally one can assume that the state sk of each ion channel p has a (p) ﬁnite number of values Np (instead of just four states). If uk = p, the state sk of ion channel p evolves according to an Np -state homogeneous Markov chain with transition probability matrix

(p) (p) (p) (4.13) A(p) = (aij )i,j∈Np = P sk+1 = j | sk = i if ion channel p is active at time k. The states of all the other (P − 1) ion channels (q) (q) that are not activated are unaﬀected, i.e., sk+1 = sk , q = p. To complete our probabilistic formulation, assume the initial states of all ion channels on the chip (p) (p) (p) are initialized with prior distributions: s0 ∼ x0 where x0 , are speciﬁed initial distributions for p = 1, . . . , P . The above formulation captures the essence of an activation controlled patchclamp chip—the channel activation scheduler dynamically decides which single ion channel to activate at each time instant. 4.3.2

Patch-Clamp Ampliﬁer and Hidden Markov Model Measurements (p)

The state of the active ion channel sk on the chip is not directly observed. Instead, (p) the output of the patch-clamp ampliﬁer is the ion channel current ik observed in large amounts of thermal noise. This output is quantized to an M symbol (p) alphabet set yk ∈ {O1 , O2 , . . . , OM }. The probabilistic relationship between the (p) (p) observations yk and the actual ion channel state sk of the active ion channel p is summarized by the (Np × M) state likelihood matrix: (p)

B (p) = (bim )i∈Np ,m∈M , (p)

(p)

(4.14)

(p)

where bim = P (yk+1 = Om |sk+1 = i, uk = p) denotes the conditional probability (p) (symbol probability) of the observation symbol yk+1 = Om when the actual state is (p) sk+1 = i and the active ion channel is uk = p. Note that the above model allows for (p) the state likelihood probabilities (bim ) to vary with p, i.e., to vary with the spatial location of the ion channel on the patch-clamp chip, thus allowing for spatially heterogeneous measurement statistics. (u ) (u ) Let Yk = (y1 0 , . . . , yk k−1 ) denote the observed history up to time k. Let Uk = (u0 , . . . , uk ) denote the sequence of past decisions made by the ion channel activation scheduler regarding which ion channels to activate from time 0 to time k. 4.3.3

Ion Channel Activation Scheduler

The above probabilistic model for the ion channel, together with the noisy measurements from the patch-clamp ampliﬁer, constitute a well-known type of dynamic Bayesian network called a hidden Markov model (HMM) (Ephraim and Merhav, (p) 2002). The problem of state inference of a HMM, i.e., estimating the state sk given

4.3

Scheduling Multiple Ion Channels on a Biological Chip

91

(Yk , Uk ), has been widely studied. (see e.g., Chung et al. (1991); Ephraim and Merhav (2002)). In this chapter we address the deeper and more fundamental issue of how the ion channel activation scheduler should dynamically decide which ion channel to activate at each time instant in order to minimize a suitable cost function that encompasses all the ion channels. Such dynamic decision making based on uncertainty (noisy channel current measurements) transcends standard sensor-level HMM state inference, which is a well-studied problem (Chung et al., 1991). The activation scheduler decides which ion channel to activate at time k, based on the optimization of a discounted cost function which we now detail: The instantaneous cost incurred at time k due to all the ion channels (both active and inactive) is (u ) (p) r(sk , p), (4.15) Ck = −c0 (uk ) + c1 (sk k , uk ) + p=uk (u )

where −c0 (uk ) + c(sk k , uk ) denotes the cost incurred by the active ion channel uk , & (p) and p=uk r(sk , p) denotes the cost of remaining P − 1 inactive ion channels. The three components in the above cost function 4.15, can be chosen by the neurobiologist experimenter to optimize the information obtained from the patchclamp experiment. Here we present one possible choice of costs: Ion channel quality of service (QoS): c0 (p) denotes the quality of service of the active ion channel p. The minus signs in equation 4.15 reﬂects the fact that the lower the QoS the higher the cost and vice versa. State information cost: The ﬁnal outcome of the patch-clamp experiment is often the estimate of the open-state level I. The accuracy of this estimate increases linearly with the number of observations obtained in the open state (since the covariance error of the estimate decreases linearly with the data length according to the central limit theorem). Maximizing the accuracy I requires maximizing the utilization of the patch-clamp chip, i.e., maximizing the expected number of measurements made from ion channels that are in the open normal state. That is, preference should be given to activating ion channels that are normal (i.e., not desensitized) and that quickly switch to the open state compared to other ion channels. (p) Desensitization cost of inactive channels: The instantaneous cost r(sk , p) in equation 4.15 incurred by each of the P −1 inactive ion channels p ∈ {1, 2, . . . , P }−{uk } should be chosen so as to penalize desensitized channels. (u ) (u ) Based on the observed history Yk = (y1 0 , . . . , yk k−1 ), and the history of decisions Uk−1 = (u0 , . . . , uk−1 ), the scheduler needs to decide which ion channel on the chip to activate at time k. The scheduler decides which ion channel to activate at time k based on the stationary policy μ : (Yk , Uk−1 ) → uk . Here μ is a function that maps the observation history Yk and past decisions Uk−1 to the choice of which ion channel uk to activate at time k. Let U denote the class of admissible stationary policies, i.e., U = {μ : uk = μ(Yk , Uk−1 )}. The total expected discounted

92

Sensor Adaptive Signal Processing of Biological Nanotubes

reward over an inﬁnite time horizon is given by ( '∞ k Jμ = E β Ck ,

(4.16)

k=0

inﬁnite horizon discounted cost POMDP

where Ck is deﬁned in equation 4.15 and E {·} denotes mathematical expectation. The aim of the scheduler is to determine the optimal stationary policy μ∗ ∈ U which minimizes the cost in equation 4.16. The above problem of minimizing the inﬁnite horizon discounted cost 4.16 of stochastic dynamical system 4.13 with noisy observations (equation 4.14) is a partially observed Markov decision process (POMDP) problem. Developing numerically eﬃcient ion channel activation scheduling algorithms to minimize this cost is the subject of the rest of this section. 4.3.4

Formulation of Activation Scheduling as a Multiarmed Bandit

The above stochastic control problem (eq. 4.16) is an inﬁnite-horizon partially observed Markov decision process with a multiarmed bandit structure which considerably simpliﬁes the solution. But ﬁrst, as is standard with partially observed stochastic control problems—we convert the partially observed multiarmed bandit problem to a fully observed multiarmed bandit problem deﬁned in terms of the information state (Bertsekas, 1995a). 4.3.5

information state

HMM ﬁlter

Information State Formulation

For each ion channel p, the information state at time k—which we will denote by (p) xk (column vector of dimension Np )—is deﬁned as the conditional ﬁltered density (p) (p) of the Markov chain state sk given Yk and Uk−1 :

(p) (p) (4.17) xk (i) = P sk = i | Yk , Uk−1 , i = 1, . . . , Np . The information state can be computed recursively by the HMM state ﬁlter, which is also known as the forward algorithm or Baum’s algorithm (James et al., 1996), according to equation 4.18 below. In terms of the information state formulation, the ion channel activation scheduling problem described above can be viewed as the following dynamic scheduling problem: Consider P parallel HMM state ﬁlters, one for each ion channel (p) on the chip. The pth HMM ﬁlter computes the state estimate (ﬁltered density) xk of the pth ion channel, p ∈ {1, . . . , P }. At each time instant, only one of the P (p) ion channels is active, say ion channel p, resulting in an observation yk+1 . This is processed by the pth HMM state ﬁlter, which updates its Bayesian estimate of the ion channel’s state as (p)

(p) xk+1

=

(p)

B (p) (yk+1 )A(p) xk

1 B (p) (yk+1 )A(p) xk (p)

(p)

if ion channel p is active,

(4.18)

4.3

Scheduling Multiple Ion Channels on a Biological Chip (p)

93 (p)

(p)

where if yk+1 = Om , then B (p) (m) = diag[b1m , . . . , bNp ,m ] is the diagonal matrix formed by the mth column of the observation matrix B (p) and 1 is an Np dimensional column unit vector (we use to denote transpose). The state estimates of the other P − 1 HMM state ﬁlters remain unaﬀected, i.e., if ion channel q is inactive, (q)

(q)

xk+1 = xk ,

q ∈ {1, . . . , P }, q = p.

(4.19)

Let X (p) denote the state space of information states x(p) for ion channels p ∈ {1, 2, . . . , P }. That is, for all i ∈ {1, . . . , Np } X (p) = x(p) ∈ RNp : 1 x(p) = 1, 0 < x(p) (i) < 1 . (4.20) Note that X (p) is an (Np − 1)-dimensional simplex. Using the smoothing property of conditional expectations, the cost function 4.16 can be rewritten in terms of the information state as ⎫ ⎧ ⎬ ∞ ⎨ (u ) (p) Jμ = E β k c (uk )xk k + r (p)xk (4.21) ⎭ ⎩ k=0

p=uk

(p)

on-going multiarmed bandit

(p)

where c(uk ) denotes the Nuk -dimensional reward vector [c(sk = 1, uk ), . . . , c(sk = (p) (p) Nuk , uk )] , and r(p) is the Nuk -dimensional reward vector [r(sk = 1, p), . . . , c(sk = Np , p)] . The aim is to compute the optimal policy arg minμ∈U Jμ . In terms of equations 4.18 and 4.21, the multiarmed bandit problem reads thus: Design an optimal dynamic scheduling policy to choose which ion channel to activate and hence which HMM Bayesian state estimator to use at each time instant. As it stands the POMDP problem of equations 4.18, 4.19, and 4.21 or equivalently that of equations 4.16, 4.13, and 4.14 has a special structure: (1) Only one Bayesian HMM state estimator operates according to 4.18 at each time k, or equivalently, only one ion channel is active at a given time k. The remaining (q) P − 1 Bayesian estimates xk remain frozen, or equivalently, the remaining P − 1 ion channels remain inactive. (2) The active ion channel incurs a cost depending on its current state and QoS. Since the state estimates of the inactive ion channels are frozen, the cost incurred by them is a ﬁxed constant depending on the state when they were last active. The above two properties imply that equations 4.18, 4.19, and 4.21 constitute what Gittins (1989) terms as an ongoing multiarmed bandit. It turns out that by a straightforward transformation an ongoing bandit can be formulated as a standard multiarmed bandit. It is well known that the multiarmed bandit problem has a rich structure which results in the ion channel activation scheduling problem decoupling into P independent optimization problems. Indeed, from the theory of multiarmed bandits it follows that the optimal scheduling policy has an indexable rule (Whittle, 1980): (p) for each channel p there is a function γ (p) (xk ) called the Gittins index, which is (p) only a function of the ion channel p and its information state xk , whereby the

94

Sensor Adaptive Signal Processing of Biological Nanotubes

optimal ion channel activation policy at time k is to activate the ion channel with the largest Gittins index, i.e., (p) γ (p) (xk ) . (4.22) activate ion channel q where q = max p∈{1,... ,P }

A proof of this index rule for general multiarmed bandit problems is given by Whittle (1980). Computing the Gittins index is a key requirement for devising an optimal activation policy for the patch-clamp chip. We refer the reader to the paper of Krishnamurthy (2004) for details on how the Gittins index is computed and numerical examples of the performance of the algorithm. Remarks: The indexable structure of the optimal ion channel activation policy (eq. 4.22) is convenient for two reasons: (1) Scalability: Since the Gittins index is computed for each ion channel independently of every other ion channel (and this computation is oﬀ-line), the ion channel activation problem is easily scalable in that we can handle several hundred ion channels on a chip. In contrast, without taking the multiarmed bandit structure into account, the POMDP has NpP underlying states, making it computationally impossible to solve—e.g., for P = 50 channels with Np = 2 states per channel, there are 250 states! (2) Suitability for heterogeneous ion channels: Notice that our formulation of the ion channel dynamics allows for them to have diﬀerent transition probabilities and likelihood probabilities. Moreover, since the Gittins index of an ion channel does not depend on other ion channels, we can meaningfully compare diﬀerent types of ion channels.

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

gramicidin-A

In the previous section we dealt with ion channels at a macroscopic level—both in the spatial and time scales. The permeation problem considered in this and the following two sections seeks to explain the working of an ion channel at an ˚ A(angstrom unit = 10−10 m) spatial scale by studying the propagation of individual ions through the ion channel at a femto second (10−15 timescale). This setup is said to be at a mesoscopic scale since the individual ions (e.g., Na+ ions) are of the order of a few ˚ Ain radius and are comparable in radius to the ion channel. At this mesoscopic level, point charge approximations and continuum electrostatics break down. The discrete ﬁnite nature of each ion needs to be taken into consideration. Also, failure of the mean ﬁeld approximation in narrow channels implies that any theory that aspires to relate channel structure to its function must treat ions explicitly. For convenience we focus in this section primarily on gramicidin-A channels— which are one of the simplest ion channels. Gramicidin-A is an antibiotic produced by Bacillus brevis. It was one of the ﬁrst antibiotics to be isolated in the 1940s (Finkelstein, 1987, p. 130). In submicromolar concentrations it can increase the

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

95

conductance of a bacterial cell membrane (which is a planar lipid bilayer membrane) by more than seven orders of magnitude by the formation of cation selective channels. As a result the bacterial cell is ﬂooded and dies. This property of dramatically increasing the conductance of a lipid bilayer membrane has recently been exploited by Cornell et al. (1997) to devise gramicidin-A channel based biosensors with extremely high gains. The aim of this section and the following two sections is to develop a stochastic dynamical formulation of the permeation problem that ultimately leads to estimating a potential of mean force (PMF) proﬁle for an ion channel by optimizing the ﬁt between the simulated current and the experimentally observed current. In the mesoscopic simulation of an ion channel, we propagate each individual ion using Brownian dynamics (Langevin equation), and the force experienced by each ion is a function of the PMF. As a result of the PMF and external applied potential to the ion channel there is a drift of ions from outside to inside the cell via the ion channel resulting in the simulated current. Determining the PMF proﬁle that optimizes the ﬁt between the mesoscopic simulated current and observed current yields useful information and insight into how an ion channel works at a mesoscopic level. Determining the optimal PMF proﬁle is important for several reasons: First, it yields the eﬀective charge density in the peptides that form the ion channel. This charge density yields insight into the crystal structure of the peptide. Second, for theoretical biophysicists, the PMF proﬁle yields information about the permeation dynamics including information about where the ion is likely to be trapped (called binding sites), the mean velocity of propagation of ions through the channel, and the average conductance of the ion channel. We refer the reader to Krishnamurthy and Chung (a,b) for complete details of the Brownian dynamics algorithm and adaptively controlled Brownian dynamics algorithms for estimating the PMF of ion channels. Also the tutorial paper by Krishnamurthy and Chung (2005) and references therein give a detailed overview of Brownian dynamics simulation for determining the structure of ion channels. 4.4.1

molecular dynamics vs. Brownian dynamics

Levels of Abstraction for Modeling Ion Channels at the Nanoscale

The ultimate aim of theoretical biophysicists is to provide a comprehensive physical description of biological ion channels. At the lowest level of abstraction is the ab initio quantum mechanical approach, in which the interactions between the atoms are determined from ﬁrst-principles electronic structure calculations. Due to the extremely demanding nature of the computations, its applications are limited to very small systems at present. A higher level of modeling abstraction is to use classical molecular dynamics. Here, simulations are carried out using empirically-determined pairwise interaction potentials between the atoms, via ordinary diﬀerential equations (Newton’s equation of motion). However, it is not computationally feasible to simulate the ion channel long enough to see permeation of ions across a model channel. For that purpose, one has to go up one further step in abstraction to

96

ion channel vs. carbon nanotube

Sensor Adaptive Signal Processing of Biological Nanotubes

stochastic dynamics, of which Brownian dynamics (BD) is the simplest form, where water molecules that form the bulk of the system in ion channels are stochastically averaged and only the ions themselves are explicitly simulated. Thus, instead of considering the dynamics of individual water molecules, one considers their average eﬀect as a random force or Brownian motion on the ions. This treatment of water molecules can be viewed as a functional central limit theorem approximation. In BD, it is further assumed that the protein is rigid. Thus, in BD, the motion of each individual ion is modeled as the evolution of a stochastic diﬀerential equation, known as the Langevin equation. A still higher level of abstraction is the Poisson-Nernst-Planck (PNP) theory, which is based on the continuum hypothesis of electrostatics and the mean-ﬁeld approximation. Here, ions are treated not as discrete entities but as continuous charge densities that represent the space-time average of the microscopic motion of ions. For narrow ion channels—where continuum electrostatics does not hold—the PNP theory does not adequately explain ion permeation. Remark: Bio-Nanotube Ion Channel vs. Carbon Nanotube There has recently been much work in the nanotechnology literature on carbon nanotubes and their use in ﬁeld eﬀect transistors (FETs). BD ion channel models are more complex than that of a carbon nanotube. Biological ion channels have radii of between 2 ˚ Aand 6 ˚ A. In these narrow conduits formed by the protein wall, the force impinging on a permeating ion from induced surface charges on the water-protein interface becomes a signiﬁcant factor. This force becomes insigniﬁcant in carbon nanotubes used in FETs with radius of approximately 100 ˚ A, which is large compared to the debye length of electrons or holes in Si. Thus the key diﬀerence is that while in carbon nanotubes point charge approximations and continuum electrostatics holds, in ion channels the discrete ﬁnite nature of each ion needs to be considered. 4.4.2

Brownian Dynamics (BD) Simulation Setup

Figure 4.4 illustrates the schematic setup of Brownian dynamics simulation for permeation of ions through an ion channel. The aim is to obtain structural information, i.e., determine channel geometry and charges in the protein that forms the ion channel. Figure 4.4 shows a schematic illustration of a BD simulation assembly for a particular example of an antibiotic ion channel called a gramicidin-Aion channel. The ion channel is placed at the center of the assembly. The atoms forming the ion channel are represented as a homogeneous medium with a dielectric constant of 2. Then, a large reservoir with a ﬁxed number of positive ions (e.g., K+ or Na+ ions) and negative ions (e.g., Cl− ions) is attached at each end of the ion channel. The electrolyte in the two reservoirs comprises 55 M (moles) of H2 O, and 150 mM concentrations of Na+ and Cl− ions.

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

97

Gramicidin-Aion channel model Gramicidin-Acomprising 2N ions within two cylindrical reservoirs R1 , R2 , connected by the ion channel C .

Figure 4.4

98

Sensor Adaptive Signal Processing of Biological Nanotubes

4.4.3

Mesoscopic Permeation Model of Ion Channel

Our permeation model for the ion channel comprises 2 cylindrical reservoirs R1 and R2 connected by the ion channel C as depicted in ﬁg. 4.4, in which 2N ions are inserted (N denotes a positive integer). In ﬁg. 4.4, as an example we have chosen a gramicidin-Aantibiotic ion channel—although the results below hold for any ion channel. These 2N ions comprise (1) N positively charged ions indexed by i = 1, 2, . . . , N . Of these, N/2 ions indexed by i = 1, 2, . . . N/2 are in R1 , and N/2 ions indexed by i = N/2 + 1, . . . , 2N are in R2 . Each Na+ ion has charge q + , mass m(i) = m+ = 3.8 × 10−26 kg and frictional coeﬃcient m+ γ + , and radius r+ ; and (2) N negatively charge ions indexed by i = N + 1, N + 2, . . . , 2N . Of these, N/2 ions indexed by i = N = 1, . . . 3N/2 are placed in R1 and the remaining N/2 ions indexed by i = 3N/2 + 1, . . . , 2N are placed in R2 . Each negative ion has charge q (i) = q − , mass m(i) = m− , frictional coeﬃcient m− γ − , and radius r− . R = R1 ∪ R2 ∪ C denotes the set comprised of the interior of the reservoirs and ion channel. Let t ≥ 0 denote continuous time. Each ion i, moves in three-dimensional (i) (i) (i) (i) (i) space over time. Let xt = (xt , yt , zt ) ∈ R and vt ∈ R3 denote the po(i) (i) (i) sition and velocity of ion i and time t. The three components xt , yt , zt of (i) xt ∈ R are, respectively, the x, y, and z position coordinates. An external potential Φext λ (x) is applied along the z-axis of ﬁg. 4.4, i.e., with x = (x, y, z), (x) = λz, λ ∈ Λ. Here Λ denotes a ﬁnite set of applied potentials. TypΦext λ ically Λ = {−200, −180, . . . , 0, . . . , 180, 200} mV/m. Due to this applied external potential, the Na+ ions drift from reservoir R1 to R2 via the ion chan(1) (2) (3) (2N ) ∈ R2N and Vt = nel C in ﬁg. 4.4. Let Xt = xt , xt , xt , . . . , xt (1) (2) (3) (2N ) 6N vt , vt , vt , . . . , vt ∈R denote the velocities of all the 2N ions. The position and velocity of each individual ion evolves according to the following continuous-time stochastic dynamical system:

t (i) (i) vs(i) ds, (4.23) xt = x0 + 0

t

t (i) (i) (i) (i) (i) m+ γ + (X(i) )v ds + Fθ,λ (Xs )ds + b+ wt , m+ vt = m+ v0 − s s 0

m− vt

(i)

0

i ∈∈ {1, 2, . . . , N }, (4.24)

t

t (i) (i) (i) (i) = m− v0 − m− γ − (X(i) Fθ,λ (Xs )ds + b− wt , s )vs ds + 0

0

i ∈ {N + 1, N + 2, . . . , 2N }. Langevin equation

(4.25)

Equations 4.24 and 4.25 constitute the well-known Langevin equations and describe (i) the evolution of the velocity vt of ion i as a stochastic dynamical system. The (i) random process {wt } denotes a three-dimensional Brownian motion, which is 2 component-wise independent. The constants b+ and b− are, respectively, b+ = 2 (i) (j) 2m+ γ + kT , b− = 2m− γ − kT . Finally, the noise processes {wt } and {wt }, that drive any two diﬀerent ions, j = i, are assumed to be statistically independent.

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

99

(i)

(i)

In equations 4.24 and 4.25, Fθ,λ (Xt ) = −q (i) ∇x(i) Φθ,λ (Xt ) represents the t

(i)

systematic force acting on ion i, where the scalar-valued process Φθ,λ (Xt ) is the total electric potential experienced by ion i given the position Xt of the 2N ions. The subscript λ is the applied external potential. The subscript θ is a parameter that characterizes the potential of mean force (PMF) proﬁle, which is an important (i) component of Φθ,λ (Xt ). It is convenient to represent the above system (equations 4.23, 4.24, and 4.25) as a vector stochastic diﬀerential equation. Deﬁne the following vector-valued variables: ⎤ ⎡ ⎡ (1) ⎤ ⎡ (N +1) ⎤ ⎡ ⎤ 02N ×1 ⎢ (1) ⎥ vt vt Xt ⎥ ⎢ + wt ⎥ ⎥ ⎢ . ⎥ ⎢ ⎢ +⎥ Vt ⎥ .. ⎥ , Vt− = ⎢ ... ⎥ , wt = ⎢ , Vt+ = ⎢ Vt = ⎢ . ⎥ , ζt = ⎢ ⎣ ⎣ ⎣ Vt ⎦ , ⎦ ⎦ − ⎢ .. ⎥ Vt − ⎣ ⎦ (N ) (2N ) Vt vt vt (2N ) wt ⎡ (1) ⎡ (N +1) ⎤ ⎤ Fθ,λ (Xt ) Fθ,λ (Xt ) + 1 ⎢ ⎢ ⎥ ⎥ F (X ) + t . . ⎢ ⎥ , F− (Xt ) = ⎢ ⎥ , Fθ,λ (Xt ) = m θ,λ .. .. F+ . θ,λ (Xt ) = ⎣ θ,λ ⎣ ⎦ ⎦ − 1 m− Fθ,λ (Xt ) (N ) (2N ) Fθ,λ (Xt ) Fθ,λ (Xt ) (4.26) Then equations 4.23, 4.24, and 4.25 can be written compactly as dζt = Aζt dt + fθ,λ (ζt )dt + Σ1/2 dwt , where Σ1/2 = block diag(06N ×6N , b+ /m+ I3N ×3N , b− /m− I3N ×3N ), ⎡ ⎤ 06N ×6N I6N ×6N ⎢ ⎥ 06N ×1 + ⎢ ⎥ A=⎣ . −γ I3N ×3N 03N ×3N ⎦ , fθ,λ (ζt ) = 06N ×6N Fθ,λ (Xt ) − 0N ×N −γ IN ×N

(4.27)

(4.28)

We will subsequently refer to equations 4.27 and 4.28 as the Brownian dynamics equations for the ion channel. Remark: The BD approach is a stochastic averaging theory framework that models the average eﬀect of water molecules: (i) 1. The friction term mγvt dt captures the average eﬀect of the ions driven by the applied external electrical ﬁeld bumping into the water molecules every few femtoseconds. The frictional coeﬃcient is given from Einstein’s relation. (i) 2. The Brownian motion term wt also captures the eﬀect of the random motion of ions bumping into water molecules and is given from the ﬂuctuation-dissipation theorem.

100

Sensor Adaptive Signal Processing of Biological Nanotubes

4.4.4

Systematic Force Acting on Ions

As mentioned after equation 4.25, the systematic force experienced by ion i is (i)

(i)

Fθ,λ (Xt ) = −q (i) ∇x(i) Φθ,λ (Xt ), t

(i)

where the scalar valued process Φθ,λ (Xt ) denotes the total electric potential experienced by ion i given the position Xt of all the 2N ions. We now give a detailed formulation of these systematic forces. (i) The potential Φθ,λ (Xt ) experienced by each ion i comprises the following ﬁve components: (i)

(i)

(i)

(i)

IW (xt ) + ΦC,i (Xt ) + ΦSR,i (Xt ). Φθ,λ (Xt ) = Uθ (xt ) + Φext λ (xt ) + Φ

(4.29)

(i)

Just as Φθ,λ (Xt ) is decomposed into ﬁve terms, we can similarly decompose the (i) (i) force Fθ,λ (Xt ) = −q∇x(i) Φθ,λ (Xt ) experienced by ion i as the superposition (vector t sum) of ﬁve force terms, where each force term is due to the corresponding potential in equation 4.29—however, for notational simplicity we describe the scalar-valued potentials rather than the vector-valued forces. (i) (i) Note that the ﬁrst three terms in equation 4.29, namely Uθ (xt ), Φext λ (xt ), (i) (i) ΦIW (xt ), depend only on the position xt of ion i, whereas the last two terms in C,i SR,i (Xt ), depend on the distance of ion i to all the other equation 4.29, Φ (Xt ), Φ ions, i.e., the position Xt of all the ions. The ﬁve components in equation 4.29 are now deﬁned. PMF (i)

Potential of mean force (PMF), denoted Uθ (xt ) in equation 4.29, comprises electric forces acting on ion i when it is in or near the ion channel (nanotube C in (i) ﬁg. 4.4). The PMF Uθ is a smooth function of the ion position xt and depends on the structure of the ion channel. Therefore, estimating Uθ (·) yields structural information about the ion channel. In section 4.6, we outline an adaptive Brownian dynamics approach to estimate the PMF Uθ (·). The PMF Uθ originates from two diﬀerent sources; see Krishnamurthy and Chung (2005) for details. First, there are ﬁxed charges in the channel protein, and the electric ﬁeld emanating from them renders the pore attactive to cations and repulsive to anions, or vice versa. Some of the amino acids forming the ion channels carry the unit or partial electronic charges. For example, glutamate and aspartate are acidic amino acids, being negatively charged at pH 6.0, whereas lysine, arginine, and histidine are basic amino acids, being positively charged at pH 6.0. Second, when any of the ions in the assembly comes near the protein wall, it induces surface charges of the same polarity at the water-protein interface. This is known as the induced surface charge. External applied potential: In the vicinity of the cell, there is a strong electric ﬁeld resulting from the membrane potential, which is generated by diﬀuse, unpaired, ionic clouds on each side of the membrane. Typically, this resting potential across a cell membrane, whose thickness is about 50 ˚ A, is 70 mV, the cell interior negative

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

101

with respect to the extracellular space. In simulations, this ﬁeld is mimicked by applying a uniform electric ﬁeld across the channel. This is equivalent to placing a pair of large plates far away from the channel and applying a potential diﬀerence between the two plates. Because the space between the electrodes is ﬁlled with electrolyte solutions, each reservoir is in isopotential. That is, the average potential anywhere in the reservoir is identical to the applied potential at the voltage place (i) on that side. For ion i at position xt = x = (x, y, z), Φext λ (x) = λz denotes the potential on ion i due to the applied external ﬁeld. The electrical ﬁeld acting on each ion due to the applied potential is therefore −∇x(i) Φext λ (x) = (0, 0, λ) V/m t at all x ∈ R. It is this applied external ﬁeld that causes a drift of ions from the reservoir R1 to R2 via the ion channel C. As a result of this drift of ions within the electrolyte in the two reservoirs, eventually the measured potential drop across the reservoirs is zero and all the potential drop occurs across the ion channel. Inter-ion Coulomb potential: In equation 4.29, ΦC,i (Xt ) denotes the Coulomb interaction between ion i and all the other ions ΦC,i (Xt ) =

1 4π0

2N

q (j)

(i) j=1,j=i w xt

(j)

− xt

.

(4.30)

Ion-wall interaction potential: The ion-wall potential ΦIW , also called the (σ/r)9 , (i) potential ensures that the position xt of all ions i = 1, . . . , 2N lie in Ro . With (i) (i) (i) (i) xt = (xt , yt , zt ) , it is modeled as (i)

ΦIW (xt ) =

(r(i) + rw )9 F0 / 9 , 9 (i) 2 (i) 2 (xt + yt rc + r w −

(4.31)

where for positive ions r(i) = r+ (radius of Na+ atom) and for negative ions r(i) = r− (radius of Cl− atom); rw = 1.4 ˚ Ais the radius of atoms making up the wall; rc denotes the radius of the ion channel, and F0 = 2 × 10−10 N, which is estimated from the ST2 water model used in molecular dynamics (Stillinger and Rahman, 1974). This ion-wall potential results in short-range forces that are only signiﬁcant when the ion is close to the wall of the reservoirs R1 and R2 or anywhere in the ion channel C (since the ion channel is comparable in radius to the ions). Short-range potential: Finally, at short ranges, the Coulomb interaction between two ions is modiﬁed by adding a potential ΦSR,i (Xt ), which replicates the eﬀects of the overlap of electron clouds. Thus, Φ

SR,i

F0 (Xt ) = 9

2N

(r(i) + r(j) )

j=1,j=i

xt − xt 9

(i)

(j)

.

(4.32)

Similar to the ion-wall potential, ΦSR,i is signiﬁcant only when ion i gets very close to another ion. It ensures that two opposite-charge ions attracted by inter-ion Coulomb forces 4.30 cannot collide and annihilate each other. Molecular dynamics simulations show that the hydration forces between two ions add further structure

102

Sensor Adaptive Signal Processing of Biological Nanotubes (i)

(j)

to the 1/|xt − xt 9 repulsive potential due to the overlap of electron clouds in the form of damped oscillations (Gu` ardia et al., 1991a,b). Corry et al. (2001) incorporated the eﬀect of the hydration forces in equation 4.32 in such a way that the maxima of the radial distribution functions for Na+ -Na+ , Na+ -Cl− , and Cl− Cl− would correspond to the values obtained experimentally.

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm Having given a complete description of the dynamics of individual ions that permeate through the ion channel, in this section we give a probabilistic characterization of the ion channel current. In particular, we show that the mean ion channel current satisﬁes a boundary-valued partial diﬀerential equation. We then show that the Brownian dynamics (BD) simulation algorithm can be viewed as a randomized multiparticle-based algorithm for solving the boundary-valued partial diﬀerential equation to estimate the ion channel current. 4.5.1

Probabilistic Characterization of Ion Channel Current in Terms of Mean Passage Time

The aim of this subsection is to give a probabilistic characterization of the ion channel current in terms of the mean ﬁrst passage time of the diﬀusion process (see equation 4.27). This characterization also shows that the Brownian dynamical system 4.27 has a well-deﬁned, unique stationary distribution. A key requirement in any mathematical construction is that the concentration of ions in each reservoir R1 and R2 remains approximately constant and equal to the physiological concentration. The following probabilistic construction ensures that the concentration of ions in reservoir R1 and R2 remain approximately constant. Step 1: The 2N ions in the system are initialized as described above, and the ion channel C is closed. The system evolves and attains stationarity. Theorem 4.2 below shows that the probability density function of the 2N particles converges geometrically fast to a unique stationary distribution. Theorem 4.3 shows that in the stationary regime, all positive ions in R1 have the same stationary distribution and so are statistically indistinguishable (similarly for R2 ). Step 2: After stationarity is achieved, the ion channel is opened. The ions evolve according to equation 4.27. As soon as an ion from R1 crosses the ion channel C and enters R2 , the experiment is stopped. Similarly if an ion from R2 cross C and enters R1 , the experiment is stopped. Theorem 4.3 gives partial diﬀerential equations for the mean minimum time an ion in R1 takes to cross the ion channel and reach R2 and establishes that this time is ﬁnite. From this a theoretical expression for the mean ion channel current is constructed (eq. 4.42).

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

103

Note that if the system was allowed to evolve for an inﬁnite time with the channel open, then eventually, due to the external applied potential, more ions would be in R2 than R1 . This would violate the condition that the concentration of particles in R1 and R2 remain constant. In the BD simulation algorithm 4.2 presented later in this chapter, we use the above construction to restart the simulation each time an ion crosses the channel—this leads to a regenerative process that is easy to analyze. Let (1) (2) (θ,λ) (2N ) (1) (2) (2N ) (X, V) = p(θ,λ) xt , xt , . . . , xt , vt , vt , . . . , vt πt denote the joint probability density function (pdf) of the position and velocity of all the 2N ions at time t. We explicitly denote the θ, λ dependence of the pdfs since external potential λ. Then the joint pdf they depend on the PMF Uθ and applied (1) (θ,λ) (2) (2N ) of the positions of all 2N ions at time t is (X) = p(θ,λ) xt , xt , . . . , xt πt

(θ,λ) (θ,λ) πt (X) = πt (X, V)dV. R6N

The following result, proved in Krishnamurthy and Chung (a), states that for (θ,λ) (X, V) converges exponentially fast the above stochastic dynamical system, πt (θ,λ) to its stationary (invariant) distribution π∞ (X, V). Theorem 4.2 For the Brownian dynamics system (4.27, 4.28), with ζ = (X, V), there exists a (θ,λ) unique stationary distribution π∞ (ζ), and constants K > 0 and 0 < ρ < 1, such that sup ζ∈R2N ×R6N

(θ,λ)

|πt

(θ,λ) (ζ) − π∞ (ζ)| ≤ KV(ζ)ρt .

(4.33)

Here V(ζ) > 1 is an arbitrary measurable function on R2N × R6N . The above theorem on the exponential ergodicity of ζt = (Xt , Vt ) has two consequences that we will subsequently use. First, it implies that as the system evolves, (i) (i) the initial coordinates x0 , v0 of all the 2N ions are forgotten exponentially fast. This allows us to eﬃciently conduct BD simulations in section 4.5.2 below. Second, the exponential ergodicity also implies that a strong law of large numbers holds— this will be used below to formulate a stochastic optimization problem in terms of (θ,λ) the stationary measure π∞ for computing the potential mean force. Notation is as follows: For ζ = (ζ (1) , . . . ζ (4N ) ) , deﬁne ∂ ∂ ∂ , , . . . , (4N ) . ∇ζ = ∂ζ (1) ∂ζ (2) ∂ζ

104

Sensor Adaptive Signal Processing of Biological Nanotubes

0 1 For a vector ﬁeld fθ,λ (ζ) = f (1) (ζ) f (2) (ζ) · · · f (4N ) (ζ) deﬁned on R4N , deﬁne the divergence operator div (fθ,λ ) =

∂f (1) ∂f (2) ∂f (4N ) + (2) + · · · + (4N ) . (1) ∂ζ ∂ζ ∂ζ

For the stochastic dynamical system 4.27, comprising 2N ions, deﬁne the backward elliptic operator (inﬁnitesimal generator) L and its adjoint L∗ for any test function φ(ζ) as 1 Tr[Σ∇2ζ φ(ζ)] + (fθ,λ (ζ) + Aζ) ∇ζ φ(ζ), 2 1 1 0 L∗ (φ) = Tr ∇2ζ (Σφ(ζ)) − div[(Aζ + fθ,λ (ζ))φ(ζ)]. 2 L(φ) =

(4.34)

Here, fθ,λ and Σ are deﬁned in equation 4.28. (θ,λ) (·) of ζt = (Xt , Vt ) It is well known that the probability density function πt satisﬁes the Fokker-Planck equation (Wong and Hajek, 1985): (θ,λ)

dπt dt Fokker-Planck equation

= L∗ πt

(θ,λ)

.

(4.35) (θ,λ)

Also the stationary probability density function π∞ (·) satisﬁes

(θ,λ) (θ,λ) L∗ (π∞ ) = 0, π∞ (X, V)dX dV = 1. R6N

(4.36)

R2N

We next show that once stationarity has been achieved, the N positive ions behave statistically identically, i.e., each ion has the same stationary marginal (θ,λ) distribution. Deﬁne the stationary marginal density π∞ (x(i) , v(i) ) of ion i as

(θ,λ) π∞ (x(i) , v(i) ) =

R6N −3

R2N −1

(θ,λ) π∞ (X, V)

2N

dx(j) dv(j) .

(4.37)

j=1,j=i

The following result states that the ions are statistically indistinguishable—see the paper of Krishnamurthy and Chung (a) for proof. Theorem 4.3 Assuming that the ion channel C is closed, the stationary marginal densities for the positive ions in R1 are identical:

(θ,λ) (θ,λ) (θ,λ) (θ,λ),R1 π∞ = π∞ (x(1) , v(1) ) = π∞ (x(2) , v(2) ) = · · · = π∞ (x(N ) , v(N/2) ).

Similarly, the stationary marginal densities for the positive ions in R2 are identical:

(θ,λ),R2 (θ,λ) (θ,λ) π∞ = π∞ (x(N/2+1) , v(N/2+1) ) = π∞ (x(N/2+2) , v(N/2+2) ) (θ,λ) (x(N ) , v(N ) ). = · · · = π∞

(4.38)

Theorem 4.3 is not surprising: equations 4.23, 4.24, and 4.25 are symmetric in

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

mean ﬁrstpassage time

105

i, therefore intuitively one would expect that once steady state as been attained, all the positive ions behave identically—similarly with the negative ions. Due to above result, in our probabilistic formulation below, once the system has attained steady state, any positive ion is representative of all the N positive ions, and similarly for the negative ions. Assume that the system 4.27 comprising 2N ions has attained stationarity with the ion channel C closed. Then the ion channel is opened so that ions can diﬀuse (θ,λ) into it. Let τR1 ,R2 denote the mean ﬁrst-passage time for any of the N/2 Na+ ions (θ,λ) in R1 to travel to R2 via the gramicidin-Achannel C, and τR2 ,R1 denote the mean ﬁrst-passage time for any of the N/2 Na+ ions in R2 to travel to R1 :

(θ,λ) (1) (2) (N/2) τR1 ,R2 = E {tβ } where tβ = inf t : max zt , zt , . . . , zt ≥β ,

(θ,λ) (N/2+1) (N/2+2) (2N ) ≤α . , zt , . . . , zt τR2 ,R1 = E {tα } where tα = inf t : min zt (4.39) Note that for ion channels such as gramicidin-A, only positive Na+ ions ﬂow through the channel to cause the channel current—so we do not need to consider the mean ﬁrst-passage times of the Cl− ions. In order to give a partial diﬀerential equation (θ,λ) (θ,λ) for τR1 ,R2 and τR2 ,R1 , it is convenient to deﬁne the closed sets 2 3 (1) (2) (N/2) ≥ β} , P2 = ζ : {z ≥ β} ∪ {z ≥ β} ∪ · · · ∪ {z 2 3 (4.40) P1 = ζ : {z (N/2+1) ≤ α} ∪ {z (N/2+2) ≤ α} ∪ · · · ∪ {z (2N ) ≤ α} .

(1) (2) (N/2) Then it is clear that ζt ∈ P2 is equivalent to max zt , zt , . . . , zt ≥ β to R2 . since either expression implies that at least one ion has crossed from R 1 (N/2+1) (N/2+2) (2N ) Similarly ζt ∈ P1 is equivalent to min zt , zt , . . . , zt ≤ α. Thus tβ and tα deﬁned in system 4.39 can be expressed as tβ = inf{t : ζt ∈ P2 }, tα = inf{t : ζt ∈ P1 }. Hence system 4.39 is equivalent to (θ,λ)

τR1 ,R2 = E {inf{t : ζt ∈ P2 }} ,

(θ,λ)

τR2 ,R1 = E {inf{t : ζt ∈ P1 }} . (θ,λ)

(4.41) (θ,λ)

In a gramicidin-Achannel, typically τR2 ,R1 is much larger compared to τR1 ,R2 . (θ,λ) (θ,λ) In terms of the mean ﬁrst-passage times τR1 ,R2 , τR2 ,R1 deﬁned in equations 4.39 and 4.41, the mean current ﬂowing from R1 via the gramicidin-Aion channel C into R2 is deﬁned as 1 1 − (θ,λ) . (4.42) I (θ,λ) = q + (θ,λ) τR1 ,R2 τR2 ,R1 The following result adapted from (Gihman and Skorohod, 1972, p. 306) shows (θ,λ) (θ,λ) that τR1 ,R2 , τR2 ,R1 satisfy a boundary-valued partial diﬀerential equation.

106

Sensor Adaptive Signal Processing of Biological Nanotubes

Theorem 4.4 (θ,λ) (θ,λ) The mean ﬁrst-passage times τR1 ,R2 and τR2 ,R1 in 4.42 are obtained as

(θ,λ) (θ,λ) (θ,λ) τR1 ,R2 = τR1 ,R2 (ζ)π∞ (ζ)dζ, Ξ R1

(θ,λ) (θ,λ) (θ,λ) τR2 ,R1 (ζ)π∞ (ζ)dζ, τR2 ,R1 =

(4.43) (4.44)

Ξ R2

where (θ,λ)

(4.45)

(θ,λ)

(4.46)

τR1 ,R2 (ζ)E {inf{t : ζt ∈ P2 |ζ0 = ζ}} , τR2 ,R1 (ζ)E {inf{t : ζt ∈ P1 |ζ0 = ζ}} . (θ,λ)

boundary-valued PDE for mean ﬁrst passage time

(θ,λ)

Here τR1 ,R2 (ζ) and τR2 ,R1 (ζ) satisfy the following boundary-valued partial diﬀerential equations (θ,λ)

ζ ∈ P2 ,

τR1 ,R2 (ζ) = 0 ζ ∈ P2 ,

(θ,λ)

ζ ∈ P1 ,

τR2 ,R1 (ζ) = 0 ζ ∈ P1 ,

L(τR1 ,R2 (ζ)) = −1 L(τR2 ,R1 (ζ)) = −1

(θ,λ) (θ,λ)

(4.47)

where L denotes the backward operator deﬁned in equation 4.34. Furthermore, (θ,λ) (θ,λ) τR1 ,R2 and τR2 ,R1 are ﬁnite. The proof of the equations 4.47 directly follows from corollary 1, p. 306 in Gihman and Skorohod (1972), which shows that the mean ﬁrst-passage time from (θ,λ) any point ζ to a closed set P2 satisﬁes equations 4.47. The proof that τR1 ,R2 and (θ,λ) τR2 ,R1 are ﬁnite follows directly from p. 145 of Friedman (1975). Remark: Equation 4.42 speciﬁes the mean current as the charge per mean time it takes for an ion to cross the ion channel. Instead of equation 4.42, an alternative deﬁnition of the mean current is the expected rate of charge across the ion channel, i.e.,

˜ λ) = q + μ(θ,λ) − μ(θ,λ) , (4.48) I(θ, R1 ,R2 R2 ,R1 (θ,λ)

(θ,λ)

where with tα and tβ deﬁned in equation 4.39, the mean rates μR1 ,R2 and μR2 ,R1 are deﬁned as 2 3 2 3 1 1 (θ,λ) (θ,λ) μR1 ,R2 = E , μR2 ,R1 = E . (4.49) tα tβ It is important to note that the two deﬁnitions of current—namely I (θ,λ) in 4.42 ˜ λ) in 4.48 are not equivalent, since E {1/tβ } = 1/E {tβ }. Similar to the and I(θ, (θ,λ) proof of theorem 4.4, partial diﬀerential equations can be obtained for μR1 ,R2 and (θ,λ) μR2 ,R1 —however, the resulting boundary conditions are much more complex than equations 4.47.

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

4.5.2

107

Brownian Dynamics Simulation for Estimation of Ion Channel Current

It is not possible to obtain explicit closed-form expressions for the mean ﬁrst passage (θ,λ) (θ,λ) times τR2 ,R1 and τR2 ,R1 and hence the current I (θ,λ) in equation 4.42. The aim of BD simulation is to obtain estimates of these quantities by directly simulating the stochastic dynamical system 4.27. In this subsection we show that the current estimates Iˆ(θ,λ) (L) (deﬁned below) obtained from an L-iteration BD simulation are statistically consistent, i.e., limL→∞ Iˆ(θ,λ) (L) = I (θ,λ) almost surely. Due to the applied external potential Φext λ (see equation 4.29), ions drift from reservoir R1 via the ion channel C to the reservoir R2 thus generating an ion channel current. In order to construct an estimate for the current ﬂowing from R1 to R2 in the BD simulation, we need to count the number of upcrossings of ions (i.e., the number of times ions cross from R1 to R2 across the region C) and downcrossings (i.e., the number of times ions cross from R2 to R1 across the region C). Recall from ﬁg. 4.3 that z = α = −12.5˚ A denotes the boundary between R1 and C, and z = β = 12.5˚ A denotes the boundary between R2 and C. Time Discretization of Ion Dynamics To implement the BD simulation algorithm described below on a digital computer, it is necessary to discretize the continuous-time dynamics (see e.g. 4.27) of the 2N ions. The BD simulation algorithm typically uses a sampling interval of Δ = 10−15 , i.e., 1 femtosecond for this time discretization, and propagates the 2N ions over a total time period of T = 10−4 seconds. The time discretization proceeds as follows: Consider a regular partition 0 = t0 < t1 < · · · < tk−1 < tk < · · · < T with discretization interval Δ = tk − tk−1 = 10−15 seconds. There are several possible methods for time discretization of the stochastic diﬀerential equation 4.27; see Kloeden and Platen (1992) for a detailed exposition. Here we brieﬂy present a zero-order hold and ﬁrstorder hold approximation. The ﬁrst-order hold approximation was derived by van Gunsteren et al. (1981). It is well known (Wong and Hajek, 1985) that over the time interval [tk , tk+1 ), the solution of equation 4.27 satisﬁes

tk+1

tk+1 AΔ A(tk+1 −τ ) e fθ,λ (ζτ )dτ + eA(tk+1 −τ ) Σ1/2 dwτ . (4.50) ζtk+1 = e ζtk + tk

tk

In the zero-order hold approximation, fθ,λ (ζτ ) is assumed to be approximately constant over the short interval [tk , tk+1 ) and is set to the constant fθ,λ (ζtk ) in equation 4.50. This yields

tk+1

tk+1 eA(tk+1 −τ ) fθ,λ (ζtk )dτ + eA(tk+1 −τ ) Σ1/2 dwτ . (4.51) ζtk+1 = eAΔ ζtk + tk

tk

In the ﬁrst-order hold, the following approximation is used in equation 4.50: fθ,λ (ζτ ) ≈ f (ζtk ) + (τ − tk )

∂fθ,λ (ζt ) . ∂t

108

Sensor Adaptive Signal Processing of Biological Nanotubes

In van Gunsteren et al. (1981), the derivative above is approximated by fθ,λ (ζtk ) − fθ,λ (ζtk−1 ) ∂fθ,λ (ζt ) ≈ . ∂t Δ Thus the ﬁrst-order hold approximation of van Gunsteren et al. (1981) yields ζtk+1 = eAΔ ζtk +

tk+1

eA(tk+1 −τ ) fθ,λ (ζtk )dτ

tk tk+1

+

eA(tk+1 −τ ) (τ − tk )

tk

fθ,λ (ζtk ) − fθ,λ (ζtk−1 ) dτ + Δ

tk+1

eA(tk+1 −τ ) Σ1/2 dwτ .

tk

(4.52) Let k = 0, 1, . . . denote discrete time where k corresponds to time tk . Note that the last integral above is merely a discrete-time Gauss-Markov process, which (d) we will denote as wk . Moreover, since the ﬁrst block element of w in (4.26) is 0, 06N ×1 (d) . (4.53) wk = (d) w ¯k (d)

(d)

where the 6N dimensional vector w ¯ k denotes the nonzero components of wk . We now elaborate on the zero-order hold model. Next, due to the simple structure of A in equation 4.28, the matrix exponentials

tk+1 eA(tk+1 −τ ) dτ (4.54) Γ = eAΔ , B = tk

in equation 4.50 can be explicitly computed as

tk+1 I I L L 6N ×6N 6N ×6N eA(tk+1 −τ ) dτ = , B= Γ = eAΔ = 06N ×6N eDΔ 06N ×6N eDΔ tk −γ + I3N ×3N 03N ×3N (4.55) , L = D−1 eDΔ − I . where D = − 03N ×3N −γ I3N ×3N Then the above update for ζtk+1 in discrete-time notation reads: (d)

(d)

(d)

(d)

ζk+1 = Γζk + B fθ,λ (ζk ) + wk .

(4.56)

Expanding this out in terms of Xk and Vk , we have the following discrete time dynamics for the positions and velocities of the 2N ions:

(d)

Xk+1 = Xk + LVk + D−1 (L − ΔI)Fθ,λ (Xk )

(4.57)

Vk+1 = eDΔ Vk + LFθ,λ (Xk ) +

(4.58)

(d) w ¯k ,

where w ¯ k is a 6N dimensional discrete-time Gauss Markov process. Brownian Dynamics Simulation Algorithm In the BD simulation Algorithm 4.2 below, we use the following notation:

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

109

The algorithm runs for L iterations where L is user speciﬁed. Each iteration l, l = 1, 2, . . . , L, runs for a random number of discrete-time steps until an ion crosses (l) the channel. We denote these random times as τˆR1 ,R2 if the ion has crossed from (l) R1 to R2 and τˆR2 ,R1 if the ion has crossed from R2 to R1 . Thus (l)

(d)

τˆR1 ,R2 = min{k : ζk

∈ P2 },

(l)

(d)

τˆR2 ,R1 = min{k : ζk

∈ P1 }.

(θ,λ)

The positive ions {1, 2, . . . , N/2} are in R1 at steady state π∞ , and the positive ions {N/2 + 1, . . . , 2N } are in R2 at steady state. LR1 ,R2 is a counter that counts how many Na+ ions have crossed from R1 to R2 , and LR2 ,R1 counts how many Na+ ions have crossed from R2 to R1 . Note that LR1 ,R2 + LR2 ,R1 = L. We only consider passage of positive Na+ ions i = 1, . . . , N across the ion channel since in a gramicidin-Achannel the ion channel current is caused only by Na+ ions. Algorithm 4.2 Brownian Dynamics Simulation Algorithm (for Fixed θ and λ) Input parameters θ for PMF and λ for applied external potential. For l = 1 to L iterations: (θ,λ)

Step 1: Initialize all 2N ions according to stationary distribution π∞ deﬁned in equation 4.36. Open ion channel at discrete time k = 0 and set k = 1.

Step 2: Propagate all 2N ions according to the time-discretized Brownian dynamical system 4.56 until time k ∗ at which an ion crosses the channel. ∗ If ion crossed ion channel from R1 to R2 , i.e., for any ion i∗ ∈ (i∗ ) (l) {1, 2, . . . , N/2}, zk∗ ≥ β, then set τˆR1 ,R2 = k ∗ . Update number of crossings from R1 to R2 : LR1 ,R2 = LR1 ,R2 + 1. ∗ If ion crossed ion channel from R2 to R1 , i.e., for any ion i∗ ∈ {N/2 + (i) (l) 1, . . . , N }, zk∗ ≤ α then set τˆR2 ,R1 = k ∗ . Update number of crossings from R2 to R1 : LR2 ,R1 = LR2 ,R1 + 1. Step 3: End for loop. Compute the mean ﬁrst passage time and mean current estimate after L iterations as (θ,λ) τˆR1 ,R2 (L)

=

LR1 ,R2

1 LR1 ,R2

Iˆ(θ,λ) (L) = q +

(l) τˆR1 ,R2 ,

(θ,λ) τˆR2 ,R1 (L)

l=1

1

(θ,λ)

τˆR1 ,R2 (L)

−

1 (θ,λ)

.

=

1 LR2 ,R1

LR2 ,R1

(l)

τˆR2 ,R1 , (4.59)

l=1

(4.60)

τˆR2 ,R1 (L)

The following result shows that the estimated current Iˆ(θ,λ) (L) obtained from a BD simulation run over L iterations is strongly consistent.

110

Sensor Adaptive Signal Processing of Biological Nanotubes

Theorem 4.5 For ﬁxed PMF θ ∈ Θ and applied external potential λ ∈ Λ, the ion channel current estimate Iˆ(θ,λ) (L) obtained from the BD simulation algorithm 4.2 over L iterations is strongly consistent, i.e., lim Iˆ(θ,λ) (L) = I (θ,λ)

L→∞

w.p.1

(4.61)

where I (θ,λ) is the mean current deﬁned in equation 4.42. Proof Since by construction each of the L iterations is statis 4.2, in algorithm (l) (l) tically independent, and E τˆR1 ,R2 , E τˆR2 ,R1 are ﬁnite (see theorem 4.4), by Kolmogorov’s strong law of large numbers (θ,λ)

(θ,λ)

lim τˆR1 ,R2 (L) = τR1 ,R2 ,

Thus q +

L→∞

1

(θ,λ) (L) 1 ,R2

τˆR

−

1

(θ,λ) (L) 2 ,R1

τˆR

(θ,λ)

(θ,λ)

lim τˆR2 ,R1 (L) = τR2 ,R1

w.p.1.

L→∞

→ I (θ,λ) w.p.1 as L → ∞.

Remark: Instead of equation 4.42, if the mean rate deﬁnition in equation 4.48 is used ˜ λ), then the following minor modiﬁcation of algorithm 4.2 for the mean current I(θ, ˜ λ). Instead of equations 4.59 and 4.60, use yields consistent estimates of I(θ, (θ,λ) μ ˆR1 ,R2 (L)

=

LR1 ,R2

Iˆ(θ,λ) (L) = q +

1

l=1

(l) τˆR1 ,R2

LR1 ,R2

1

,

(θ,λ) μ ˆR2 ,R1 (L)

=

1

l=1

(l) τˆR2 ,R1

LR2 ,R1

1 LR2 ,R1

,

(4.62)

(θ,λ) (θ,λ) ˆR2 ,R1 (L) . μ ˆR1 ,R2 (L) − μ

(4.63)

˜ λ) w.p.1, Then a virtually identical proof to theorem 4.5 yields that Iˆ(θ,λ) (L) → I(θ, as L → ∞. Implementation Details and Variations of Algorithm 4.2 In algorithm (θ,λ) 4.2, the procedure of resetting all ions to π∞ in step 1 when any ion crosses the channel can be expressed mathematically as (d)

(d)

(d)

(d)

(d)

ζk+1 = 1ζ (d) ∈P2 ∪P1 [fθ,λ (ζk ) + wk ] + 1ζ (d) ∈P2 ∪P1 ζ0 , k

k

(d)

ζ0

(θ,λ) ∼ π∞ ,

(4.64)

where P1 , P2 are deﬁned in equation 4.40. The following approximations of algorithm 4.2 can be used in actual numerical simulations. Instead of steps 2a and 2b, only remove the crossed ion denoted as i∗ and put (θ,λ),R1 (θ,λ),R2 or π∞ (eqs. 4.3 and 4.38) it back in its reservoir with probability π∞ or R . The other particles are not depending on whether it originated from R 1 2 i∗ ¯ reset. With ζ (d) k denoting the position and velocity of the crossed ion and ζ (d) k denoting the positions and velocities of the remaining 2N − 1 ions, mathematically

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel

111

this is equivalent to replacing equation 4.64 by i∗

¯ , ζ (d) ) + w ], ζk+1 = 1ζ (d) ∈P2 ∪P1 [fθ,λ (ζk ) + wk ] + 1ζ (d) ∈P2 ∪P1 [fθ,λ (ζ (d) k k k (d)

(d)

k

(d)

(d)

(d)

(d)

k

(4.65) i∗

(θ,λ),R1

where ζ (d) k ∼ π∞ {N + 1, . . . , 3N/2}.

i∗

if i ∈ {1, . . . , N/2}, and ζ (d) k

(θ,λ),R2

∼ π∞

if i ∈

i∗

As in the above approximation (eq. 4.65), except that ζ (d) k is replaced according to an uniform distribution. The above approximations are justiﬁed for three reasons: 1. Only one ion can be inside the gramicidin-Achannel C at any time instant. When this happens the ion channel behaves as though it is closed. Then the probabilistic construction of step 1 in section 4.5.1 applies. 2. The probability density functions of the remaining 2N − 1 ions converge rapidly to their stationary distribution and forget their initial distribution exponentially fast. This is due to the exponential ergodicity theorem 4.2. In comparison the time taken for an ion to cross the channel is signiﬁcantly larger. As a result the removal of crossed particles and their replacement in the reservoir happens extremely infrequently. Between such events the probability density functions of the ions rapidly converge to their stationary distribution. 3. If an ion enters the channel C, then the change in concentration of ions in the reservoir is of magnitude 1/N . This is negligible if N is chosen suﬃciently large.

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel Having given a complete description of the dynamics of individual ions in section 4.4 and the Brownian dynamics algorithm for estimating the ion channel current, in this section we describe how the Brownian dynamics algorithms can be adaptively controlled to determine the molecular structure of the ion channel. We will estimate the PMF proﬁle Uθ parameterized by θ, by computing the θ that optimizes the ﬁt between the mean current I (θ,λ) (deﬁned above in eq. 4.42) and the experimentally observed current y(λ) deﬁned below. Unfortunately, it is impossible to explicitly compute I (θ,λ) from equation 4.42. For this reason we resort to a stochastic optimization problem formulation below, where consistent estimates of I (θ,λ) are obtained via the Brownian dynamics simulation algorithm 4.2. The main algorithm presented in this section is the adaptive Brownian dynamics simulation algorithm (algorithm 4.3) which solves the stochastic optimization problem and yields the optimal PMF. We have showed the eﬀective surface charge density along the protein of the inside surface of the ion channel from the PMF (Krishnamurthy and Chung, b).

112

Sensor Adaptive Signal Processing of Biological Nanotubes

4.6.1

Formulation of PMF Estimation as Stochastic Optimization Problem

The stochastic optimization problem formulation for determining the optimal PMF estimate comprises the following four ingredients: Experimentally Observed Ion Channel Current y(λ) Neurobiologists use the patch-clamp experimental setup to obtain experimental measurements of the current ﬂowing through a single ion channel. Typically the measured discrete-time (digitized) current from a patch-clamp experiment is obtained by sampling the continuous-time observed current at 10 kHz (i.e., 0.1 millisecond intervals). Note that this is at a much slower timescale than the dynamics of individual ions which move around at a femtosecond timescale. Such patch clamping was widely regarded as a breakthrough in the 1970s for understanding the dynamics of ion channels at a millisecond timescale. From patch-clamp experimental data, neurobiologists can obtain an accurate measurement of the actual current y(λ) ﬂowing through a gramicidin-Aion channel for various external applied potentials λ ∈ Λ. For example, as shown in Chung et al. (1991), the resulting discrete time series can be modeled as HMM. Then by using a HMM maximum likelihood estimator (Chung et al., 1991; James et al., 1996), accurate estimates of the open current level y(λ) of the ion channel can be computed. Neurobiologists typically plot the relationship between the experimentally determined current y(λ) vs. applied voltage λ on an IV curve—such curves provide a unique signature for an ion channel. For our purposes y(λ) denotes the true (real-world) channel current. Loss Function Let n = 1, 2, . . . denote the batch member. For ﬁxed applied ﬁeld λ ∈ Λ, consider at batch n, running the BD simulation algorithm 4.2, resulting (θ,λ) in the simulated current In . Deﬁne the mean square error loss function equation as Q(θ, λ) = E |In(θ,λ) − y(λ)|2 , (4.66) (θ,λ) 2 where Q(θ, λ)n = In − y(λ) . Deﬁne the total loss function obtained by adding the mean square error over all the applied ﬁelds λ ∈ Λ on the IV curve as Q(θ, λ). (4.67) Q(θ) = λ∈Λ

The optimal PMF Uθ∗ is determined by the parameter θ∗ that best ﬁts the mean current I (θ,λ) to the experimentally determined IV curve of a gramicidin-Achannel, i.e., θ∗ = arg min Q(θ). θ∈Θ

(4.68)

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel

θn

Brownian Dynamics Simulation

(θ,λ) Iˆn

Loss function

Qn (θ, λ)

evaluation

Stochastic ∇ θ Qn (θ, λ) Gradient Algorithm Figure 4.5

113

Gradient Estimator

Adaptive Brownian dynamics simulation for estimating PMF.

Let Θ∗ denote the set of local minima whose elements θ∗ satisfy the second-order suﬃcient conditions for being a local minimum: θ Q(θ) = 0, ∇

2 Q(θ) > 0, ∇ θ

(4.69)

where the notation ∇2 Q(θ) > 0 means that it is a positive deﬁnite matrix. However, the deterministic optimization (eqs. 4.66,4.68) cannot be directly carried out since it is not possible to obtain explicit closed-form expressions for 4.66—this is because the partial diﬀerential equation 4.47 for the mean ﬁrst passage (θ,λ) (θ,λ) times τR2 ,R1 and τR2 ,R1 cannot be solved explicitly. This motivates us to formulate the estimation of the PMF as a stochastic optimization problem. 4.6.2

Stochastic Gradient Algorithms for Estimating Potential of Mean Force (PMF) and the Need for Gradient Estimation

We now give a complete description of the adaptive Brownian dynamics simulation algorithm for computing the optimal PMF estimate Uθ∗ . The algorithm is schematically depicted in ﬁg. 4.5. Recall that n = 0, 1, · · · , denotes batch number. Algorithm 4.3 Adaptive Brownian Dynamics Simulation Algorithm for Estimating PMF Step 0: Set batch index n = 0, and initialize θ0 ∈ Θ. Step 1 Evaluation of loss function: At batch n, evaluate loss function Qn (θn , λ) for each external potential λ ∈ Λ according to equation 4.66. This uses one independent BD simulation (algorithm 4.2) for each λ. θ Qn (θn , λ) either as a Step 2 Gradient estimation: Compute gradient estimate ∇ ﬁnite diﬀerence (see eq. 4.72 below), or according to the SPSA algorithm (eq. 4.73) below.

114

Sensor Adaptive Signal Processing of Biological Nanotubes

Step 3 Stochastic approximation algorithm: Update PMF estimate: θ Qn (θn , λ) θn+1 = θn − n+1 ∇

(4.70)

λ∈Λ

where n denotes a decreasing step size (see discussion below for choice of step size). Step 4: Set n to n + 1 and go to step 1. A crucial aspect of the above algorithm is the gradient estimation step 2. In θ Qn (θ, λ) of the gradient ∇θ Qn (θ, λ) is computed. This this step, an estimate ∇ gradient estimate is then fed to the stochastic gradient algorithm (step 3) which updates the PMF. Note that since the explicit dependence of Qn (θ, λ) on θ is not known, it is not possible to compute ∇θ Qn (θ, λ). Thus we have to resort to gradient estimation, e.g., the ﬁnite diﬀerence estimators described below or a more sophisticated algorithm such as IPA (inﬁnitesimal perturbation analysis). The step size n is typically chosen as n = /(n + 1 + R)κ ,

(4.71)

where 0.5 < κ ≤ 1 and R is some positive constant. Note that this choice of &∞ step size automatically satisﬁes the condition n=1 n = ∞, which is required for convergence of algorithm 4.3. Kiefer-Wolfowitz Finite Diﬀerence Gradient Estimator An obvious gradient estimator is obtained by ﬁnite diﬀerences as follows: Suppose θ is a p dimensional vector. Let e1 , e2 , . . . , ep denote p-dimensional unit vectors, where ei is a unit vector with 1 in the ith position and zeros elsewhere. Then the two-sided ﬁnite diﬀerence gradient estimator is ⎤ ⎡ Qn (θn + μn e1 , λ) − Qn (θn − μn e1 , λ) 2μn ⎥ ⎢ ⎢ Qn (θn + μn e2 , λ) − Qn (θn − μn e2 , λ) ⎥ ⎢ ⎥ ⎥ 2μn θ Qn (θ, λ) = ⎢ (4.72) ∇ ⎢ ⎥. .. ⎢ ⎥ . ⎢ ⎥ ⎣ ⎦ Qn (θn + μn ep , λ) − Qn (θn − μn ep , λ) 2μn Using equation 4.72 in algorithm 4.3 yields the so-called Finite diﬀerence stochastic gradient algorithm. In the above gradient estimator, μk = μ/(k +1)γ , where typically γ < κ (where κ is deﬁned in eq. 4.71), e.g., γ = 0.101 and κ = 0.602. The main disadvantages of the above ﬁnite gradient estimator are twofold. First, the bias of the gradient estimate is O(μ2n ), i.e., θ Qn (θ, λ)|θ1 , . . . , θn = O(μ2n ). E ∇ Second, the simulation cost of implementing the above estimator is large. It requires 2p BD simulations, since one BD simulation is required to evaluate Qn (θn + μn ei , λ) and one BD simulation is required to evaluate Qn (θn − μn ei , λ) for each i = 1, 2, . . . , p.

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel

115

Simultaneous Perturbation Stochastic Approximation (SPSA) Algorithm Unlike the Kiefer-Wolfowitz algorithm, the SPSA algorithm (Spall, 2003) is a novel method that picks a single random direction dn along which direction the derivative is evaluated at each batch n. Thus the main advantage of SPSA θ Qn (θ, λ) compared to ﬁnite diﬀerence is that evaluating the gradient estimate ∇ in SPSA requires only two BD simulations, i.e., the number of evaluations is independent of the dimension p of the parameter vector θ. We refer the reader to Spall (2003) and to the Web site www.jhuapl.edu/SPSA/ for details, variations, and applications of the SPSA algorithm. The SPSA algorithm proceeds as follows: Generate the p-dimensional vector dn with random elements dn (i), i = 1, . . . , p simulated as follows: ' −1 with probability 0.5 dn (i) = +1 with probability 0.5. Then the SPSA algorithm uses the following gradient estimator together with the stochastic gradient algorithm (eq. 4.70): ⎡ ⎤ Qn (θn + μn dn , λ) − Qn (θn − μn dn , λ) ⎢ ⎥ 2μn dn (1) ⎥ ⎢ ⎢ Qn (θn + μn dn , λ) − Qn (θn − μn dn , λ) ⎥ ⎥ ⎢ 2μn dn (2) θ Qn (θ, λ) = ⎢ ⎥. (4.73) ∇ ⎥ ⎢ . .. ⎥ ⎢ ⎥ ⎢ ⎣ Q (θ + μ d , λ) − Q (θ − μ d , λ) ⎦ n n n n n n n n 2μn dn (p) Here μk is chosen by a process similar to that of the Kiefer-Wolfowitz algorithm. Despite the substantial computational eﬃciency of SPSA compared to KieferWolfowitz, the asymptotic eﬃciency of SPSA is identical to the Kiefer-Wolfowitz algorithm. Thus SPSA can be viewed as a novel application of randomization in gradient estimation to break the curse of dimensionality. It can be proved that, like the ﬁnite gradient scheme, SPSA also has a bias O(μ2n ) (Spall, 2003). Remarks: In the SPSA algorithm above, the elements of dn were chosen according to a Bernoulli distribution. In general, it is possible to generate the elements of d according to other distributions, as long as these distributions are symmetric, zero mean, and have bounded inverse moments; see Spall (2003) for a complete exposition of SPSA. Convergence of Adaptive Brownian Dynamics Simulation Algorithm 4.3 Here we show that the estimates θn generated by algorithm 4.3 (whether using the Kiefer-Wolfowitz or SPSA algorithm) converge to a local minimum of the loss function. Theorem 4.6 For batch size L → ∞ in algorithm 4.2, the sequence of estimates {θn } generated by the controlled Brownian dynamics simulation algorithm 4.3, converge at n → ∞ to a the locally optimal PMF estimate θ∗ (deﬁned in eq. 4.68) with probability 1.

116

Sensor Adaptive Signal Processing of Biological Nanotubes

Outline of Proof Since by construction of the BD algorithm, for ﬁxed θ, Qn (θ, λ) are independent and identically distributed random variables, the proof of the above theorem involves showing strong convergence of a stochastic gradient algorithm with i.i.d. observations—which is quite straightforward. In Kushner and Yin (1997), almost sure convergence of stochastic gradient algorithms for state dependent Markovian noise under general conditions is presented. For the independent and identically distributed case we only need to verify the following condition for convergence. Condition A.4.11 in section 8.4 of Kushner and Yin (1997) requires uniform θ Qn (θn , λ) in equation 4.70. This holds since the discretized integrability of ∇ (θ,λ) (θ,λ) (L) ≥ 1, implying that the estimate Iˆn version of the passage time τˆ R1 ,R2

from equation 4.60 in algorithm 4.2 is uniformly bounded. Thus the evaluated loss Qn (θ, λ) (eq. 4.66) is uniformly bounded. This in turn implies that the ﬁnite diﬀerence estimate (eq. 4.72) for the Kiefer-Wolfowitz algorithm or equation 4.73 of the SPSA algorithm are uniformly bounded, which implies uniform integrability. Then theorem 4.3 of Kushner and Yin (1997) implies that the sequence {θn } generated by the above controlled BD simulation algorithm 4.3 converges with probability 1 to the ﬁxed points of the following ordinary diﬀerential equation: ' ( dθ θ Q(θ), θ Qn (θ, λ) = −∇ = −E ∇ dt λ∈Λ

namely the set Θ∗ of local minima that satisfy the second-order suﬃcient conditions in equation 4.69.

4.7

Conclusions This chapter has presented novel signal-processing and stochastic optimization (control) algorithms for two signiﬁcant problems in ion channels—namely, the gating problem and the permeation problem. For the gating problem we presented novel discrete stochastic optimization algorithms and also a multiarmed bandit formulation for activating ion channels on a biological chip. For the permeation problem, we presented an adaptive controlled Brownian dynamics simulation algorithm for estimating the structure of the ion channel. We refer the reader to Krishnamurthy and Chung (a,b) for further details of the adaptive Brownian dynamics algorithm and convergence proofs. The underlying theme of this chapter is the idea of sensor adaptive signal processing that transcends standard statistical signal processing (which deals with extracting signals from noisy measurements) to address the deeper issue of how to dynamically minimize sensor costs while simultaneously extracting a signal from noisy measurements. As we have seen in this chapter, the resulting problem is a dynamic stochastic optimization problem—whereas all traditional statistical signalprocessing problems (such as optimal and adaptive ﬁltering, parameter estima-

4.7

Conclusions

117

tion) are merely static stochastic optimization problems. Furthermore, we have demonstrated the use of sophisticated new algorithms such as stochastic discrete optimization, partially observed Markov decision processes (POMDP), bandit processes, multiparticle Brownian dynamics simulations, and gradient estimation-based stochastic approximation algorithms. These novel methods provide a powerful set of algorithms that will supersede conventional signal-processing tools such as elementary stochastic approximation algorithms (e.g., the LMS algorithm), subspace methods, etc. In current work we are examining several extensions of the ideas in this chapter, including estimating the shape of ion channels. Finally, it is hoped that this paper will motivate more researchers in the areas of statistical signal processing and stochastic control to apply their expertise to the exciting area of ion channels. Acknowledgments The author thanks his collaborator Dr. Shin-Ho Chung of the Australian National University, Canberra, for several useful discussions and for preparing ﬁg. 4.4. This chapter is dedicated to the memory of the author’s mother, Mrs. Bani Krishnamurthy, who passed away on November 26, 2004.

5

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

Timothy R. Field

5.1

Context In this chapter we outline some emerging ideas relating to the detection of spin populations arising in magnetic resonance imaging (MRI). MRI techniques are already at a mature stage of development, widely used as a research tool and practised in clinical medicine, and provide the primary noninvasive method for studying internal brain structure and activity, with excellent spatial resolution. A lot of attention is typically paid to detecting spatial anomalies in brain tissue, e.g., brain tumors, and in localizing certain areas of the brain that correspond to particular stimuli, e.g., within the motor, visual, and auditory cortices. More recently, functional magnetic resonance imaging (fMRI) techniques have been successfully applied in psychological studies analyzing the temporal response of the brain to simple known stimuli. The possibility of enhancing these techniques to deal with more sophisticated neurological processes, characterized by physiological changes occurring on shorter timescales, provides the motivation for developing real-time imaging techniques, where there is much insight to be gained, e.g., in tracking the auditory system. The concept of MRI is straightforward and can be brieﬂy summarized as follows (e.g., Brown and Semelka, 2003). The physical phenomenon of magnetic resonance is due to the Zeeman eﬀect, which accounts for the splitting of energy levels in atomic nuclei due to an applied magnetic ﬁeld. In the presence of a background magnetic ﬁeld B0 , the majority of protons (hydrogen nuclei) N+ in a material tend toward their minimum energy spin conﬁguration, with their spin vectors aligned with B0 , in the “spin-up” state (and thus in the minimum energy eigenstate of the quantum mechanical spin Hamiltonian). A smaller number N− take up the excited state with spin antiparallel to B0 , the “spin-down” state. According to statistical mechanics arguments, the ratio of the spin-up to spin-down populations N+ /N− is governed by the Bose-Einstein distribution. Thus, the majority of protons are able to absorb available energy and make a transition to an excited state, provided the energy is applied in a way that matches the resonance properties of

120

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

the protons in the material. The details of this energy absorption stem from the design of the MR experiment. A pulse of energy is applied to the material inside the background ﬁeld, which is absorbed (deterministically) and subsequently radiated away in a random-type fashion. Although the intrinsic nuclear atomic energies are very large in comparison, the diﬀerences between the spin-up/down energy levels, as predicted by the Zeeman interaction, lie in the radio frequency (RF) band of the electromagnetic spectrum. It is these energy diﬀerences (quantum gaps, if you like) that give rise to the received signal in MR experiments, through subsequent radiation of the absorbed energy. Although it is a quantum eﬀect, many ideas from classical physics can be drawn into play in terms of a qualitative understanding of the origin of the MR signal. Conservation of energy is a key component, as also are the precessional dynamics of the net magnetization and Faraday’s law of electromagnetic induction, described in section 5.2. It turns out that molecular environment aﬀects the precise values of splitting in proton spin energy levels, so that, e.g., a proton in a fat molecule has a diﬀerent absorption spectrum from a proton in water. The determination of the values of these “chemical shifts” is the basis of magnetic resonance spectroscopy (MRS). Once these “ﬁngerprint” resonance absorption values are known, one can design spectrally selective RF pulses in MR experiments so that only protons in certain chemical environments, with magnetic resonance absorption properties matching the frequency (photon energy) spectrum of the pulse, are excited into states of higher energy. The point of view taken here is that the spin population itself is the object of primary signiﬁcance in constructing an image, especially when it comes to the study of neural brain activity. The reasons for this emphasis on the spin population size, as opposed to certain notions of its time average behavior, are twofold. First, it is the spin density of protons, in a given molecular environment, that is the fundamental object of the detection. Second, the spin population is something that can in principle be studied in real time. In contrast, the more familiar techniques known as T 1- and T 2-weighted imaging involve a statistical time average, measuring the respective “spin-lattice” and “spin-spin” relaxation times: T 1 is the average restoration time for the longitudinal component and T 2 the average decay time for the transverse component of the local magnetic ﬁeld in the medium. Thereby, information concerning the short timescale properties of the population is necessarily lost. It is argued here that in principle one can infer the (local) size of a resonant spin population, from a large population whose dynamics is arbitrary, through a novel type of signal-processing technique which exploits certain ingredients in the physics of the spin dephasing process occurring in T 2 relaxation. The emphasis in this chapter is on the ideas and novel concepts, their significance in drawing together ideas from physics and signal processing, and the implications and new perspectives they provide in the context of magnetic resonance imaging. The detailed underlying mathematics, although an essential part of the sustained theoretical development, is highly technical and so reported separately, in elsewhere Field (2005). In section 5.2 we give the background to the description of an electromagnetic ﬁeld interacting with a random population, in terms of

5.2

Conceptual Framework

121

the mathematics of stochastic diﬀerential equations (SDEs). Section 5.3 illustrates how this dynamics can be applied to an arbitrary spin population, in the context of constructing an image in magnetic resonance experiments. The possible implications of these ideas for future MRI systems are described in section 5.4, where we identify certain domains of validity of the proposed model and the appropriate corresponding choice of some design parameters that would be necessary for successful implementation. We provide, without proof, two key mathematical results behind this line of development, concerning the dynamics of spin-spin relaxation in proposition 5.1 and the observability of the spin population through statistical analysis of the phase ﬂuctuations in theorem 5.1. The reader is referred to Field (2005) for their detailed mathematical derivation.

5.2

Conceptual Framework Our purpose is to identify the dynamics of spin-spin (T 2) relaxation using a geometrical description of the transverse spin population, and the mathematics of SDEs (e.g., Oksendal, 1998) to derive the continuous time statistical properties of the net transverse magnetization. In doing so we are led to an exact expression for the “hidden” state (the spin population level) in terms of the additional phase degrees of freedom in the MR signal, described in section 5.3. In our discussion we shall not conﬁne ourselves to any speciﬁc choice of population model, and thus encompass the possibility of describing the highly nonstationary behavior characteristic of brain signals that encode information in real time in response to stimuli, such as occur in the auditory system. Let us assume that the RF pulse is applied at a “pulse ﬂip” angle of 90◦ to the longitudinal B0 direction. As a result, RF energy ΔE is absorbed, and the net local magnetization is rotated into the transverse plane. Each component spin vector then rotates about the longitudinal axis at (approximately) the Larmor precessional frequency ω0 , governed by the following relations: ΔE = ˜ω0 = hγB0 /2π,

(5.1)

where γ is the gyromagnetic ratio (which varies throughout space depending on the details of the molecular environment). The resulting motion of the net local magnetization vector Mt can be understood by analogy with the Eulerian top in classical mechanics. As time progresses following the pulse, energy is transferred from the proton spins to the surroundings during the process of “spin-lattice” relaxation, and the longitudinal component of the net magnetization is gradually restored to the equilibrium value prior to the pulse being applied. Likewise, random exchange of energy between neighboring spins and small inhomogeneities in the total magnetic ﬁeld cause perturbations in the phases of the transverse spin components and “dephasing” occurs, so that the net transverse component of magnetization decays to zero. The motion of Mt can thus be visualized as a

122

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

Geometry of transverse spin population—each point represents a constituent proton, with respect to which each connecting vector (in the direction of the random walk away from the origin) represents the transverse component of the associated spin vector.

Figure 5.1

precession about the longitudinal axis, over the surface of a cone whose opening angle tends from π to 0 as equilibrium is restored. It is convenient for visualization purposes to work in a rotating frame of reference, rotating at the Larmor frequency ω0 about the longitudinal axis. (It is worth remarking that in the corresponding situation for radar scattering, ω0 is the Doppler frequency arising from bulk wave motion in the scattering surface.) This brings each transverse spin vector to rest, for a perfect homogeneous B0 . Nevertheless it is the local inhomogeneities in the total (background plus internal) magnetic ﬁeld, due to the local magnetic properties of the medium, that give rise to spin-spin (or T 2) relaxation constituting the (random) exchange of energy between spins. The local perturbations in the total magnetic ﬁeld can reasonably be considered as independent for each component spin, so that the resultant (transverse) spin vector can be modeled as a sum of independent random phasors. Thus, the pertinent expression for the resultant net transverse magnetization is s(j)

(Nt )

Mt

56 Nt 4 7 (j) aj exp iϕt , = j=1

(5.2)

5.2

Conceptual Framework

123

with (ﬂuctuating) spin population size Nt , random phasor step s(j) , and component amplitudes aj . Observe that this random walk type model is directly analogous to what has been used in the dynamical description of Rayleigh scattering (Field and Tough, 2003b) introduced in Jakeman (1980). Thus, the geometrical spin structure, transverse to the background longitudinal magnetic ﬁeld, lies in correspondence with the plane-polarized (perpendicular to the propagation vector) components of the electromagnetic ﬁeld arising in (radar) scattering and (optical) propagation (Field and Tough, 2003a), where the same types of ideas, albeit for speciﬁc types of (stationary) populations, have been experimentally veriﬁed (Field and Tough, 2003b, 2005). The geometry of ﬁg. 5.1 illustrates the isomorphism between the transverse spin structure of atomic nuclei and photon spin for a plane-polarized state, the latter familiar from radio theory as the (complex valued) electromagnetic amplitude perpendicular to the direction of propagation. Indeed, this duality between photon spin (EM wave polarization) and nuclear spin is a key conceptual ingredient in this development. The dynamical structure of equation 5.2 is supplied by a (phase) diﬀusion (j) model (Field and Tough, 2003b) which takes the component phases {ϕt } to be a collection of (displaced) Wiener processes evolving on a suitable timescale. Thus 1 (j) (j) ϕt = Δ(j) + B 2 Wt , with initialization Δ(j) . In magnetic resonance the 90◦ pulse explained above causes the spin phasors s(j) to be aligned initially; thus Δ(j) are identical for all j. Let T be the phase decoherence or spin-spin relaxation time (T 2) (j) such that {ϕt | t ≥ T } have negligible correlation. Then for t ≥ T , deﬁning the 1 (N ) 2 (normalized) resultant by mt = limN →∞ Mt /N , we obtain the resultant spin dynamics or net magnetization in the transverse plane (cf. Field, 2005). Proposition 5.1 For suﬃciently large times t ≥ T the spin dynamics, for a constant population, is given by the complex Ornstein-Uhlenbeck equation 1 1 1 dmt = − Bmt dt + B 2 a2 2 dξt , 2

(5.3)

where ξt is a complex-valued Wiener process (satisfying |dξt |2 = dt, dξt2 = 0). As the collection of spins radiates the absorbed energy, this gives rise to the received MR signal ψt (the free induction decay or FID), which is detected through the generation of electromotive force in a coil apparatus, due to the timevarying local magnetic ﬁeld. This eﬀect is the result of Faraday’s law, i.e., Maxwell’s vector equation ∇ × E = −∂B/∂t integrated around a current loop. The receiver coil is placed perpendicular to the transverse plane, and so only the transverse components of the change in magnetic ﬂux contribute to the FID. The MR signal thus corresponds to an amplitude process that represents the (time derivative of the) net transverse magnetization, and has the usual in-phase (I) and quadraturephase (Q) components familiar from radio theory, so ψ = I + iQ. Moreover it can be spatially localized using standard gradient ﬁeld techniques (e.g., Brown and Semelka, 2003). The constant B in equation 5.3, which has dimensions of frequency, is (proportional to) the reciprocal of the spin-spin relaxation time T 2.

124

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

In the case that the population size ﬂuctuates over the timescale of interest, we introduce the continuum population variate xt , and the receiver amplitude then has the compound representation 1

ψt = xt2 mt ,

(5.4)

in which xt and mt are independent processes.

5.3

Image Extraction An essential desired ingredient is the ability to handle real-time nonstationary behavior in the spin population. Moreover we do not wish to make any prior assumptions concerning the statistical behavior of this population, for it is this unknown state that we are trying to estimate from the observed MR signal. Indeed, its value depends on the external stimuli in a way that is not well understood, and it is precisely our purpose to uncover the nature of this dependence by processing the observable data in an intelligent way. But in doing so, we do not wish to prejudice our notions of spin population behavior, beyond some very generic assumptions set in place for the purpose of mathematical tractability. Accordingly we shall assume that the population process xt is an Ito process, i.e., that it satisﬁes the SDE (e.g., Oksendal, 1998): 1

(x)

dxt = Abt dt + (2AΣt ) 2 dWt ,

(5.5)

in which the respective drift and diﬀusion parameters bt , Σt are (real-valued) stochastic processes (not necessarily Ito), and in general include the eﬀects of nonstationary and non-Gaussian behavior. Thus, we make no prior assumption concerning the nature of these parameters, and wish to estimate the values of {xt } from our observations of the MR signal. The SDE for the resultant phase can be derived from equations 5.3, 5.4 and 5.5. Intriguingly, the behavior for a general population is functionally independent of the parameters bt and Σt , the eﬀect of these parameters coming through in the resulting evolutionary structure of the processes xt , ψt . Calculation of the resulting squared phase ﬂuctuations leads to the key noteworthy result of this chapter. Theorem 5.1 The spin population is observable through the intensity-weighted squared phase ﬂuctuations of the (FID) signal according to xt =

2 zt dθt2 /dt B

(5.6)

throughout space and time, where zt = |ψt |2 and the ﬁeld mt is scaled so that a2 = 1.

5.4

Future Systems

125

The signiﬁcance of this result is that the relation 5.6 is exact, instantaneous in nature, and moreover independent of the dynamics of the spin population. It is straightforward to illustrate this result with data sampled in discrete time (Field, 2005). The result approaches exactness as the pulse repetition rate tends to inﬁnity. More precisely, for discretely sampled data theorem 5.1 implies that zi δθi2 ∝ xi n2i ,

(5.7)

where i is a discrete time index and {ni } are an independent collection of N (0, 1) distributed random variables. This can be used to estimate the state xt via local time averaging. Applying a smoothing operation ·Δ to the left-hand side (the “observations”) of 5.7 with window Δ = [t0 − Δ, t0 + Δ] yields an approximation to xt0 , with an error that tends to zero as the number of pulses inside Δ tends to inﬁnity and Δ → 0 (Field, 2005). In this respect we observe as a consequence that in order to achieve an improved signal-to-noise ratio (SNR), it is suﬃcient merely to increase the pulse rate, without (necessarily) requiring a high-amplitude signal. The term SNR is used here in the sense of extracting the signal xt from ψt , thus overcoming the noise in mt (cf. eq. 5.4). This inference capability has been demonstrated in analysis of synthetic data, with population statistics chosen deliberately not to conform to the types of model usually encountered (Field, 2005). Instead of “ﬁltering out” the noise to obtain the signal, we have exploited useful information contained in the random ﬂuctuations in the phase. Indeed the phase noise is so strongly colored that, provided it is sampled at high enough frequency, it enables us to extract the precise values of the underlying state, in this case the population size of spins that have absorbed the RF energy. A comparison of time series of the state inferred from the observations alone, with the exact values recorded in generation of the synthetic data, shows a very high degree of correlation (Field, 2005).

5.4

Future Systems For MR imaging purposes, we have proposed focusing on the local size of the transverse spin population that results from the applied RF pulse, which is assumed to be spectrally selective. Our results demonstrate, at a theoretical/simulation level, how this “signal” can be extracted through statistical analysis of the (phase) ﬂuctuations in the received (complex) amplitude signal. In the MR context, the spin population, which assumes a Bose-Einstein distribution in equilibrium, responds in a dynamical nonstationary fashion to applied RF radiation, and our results suggest means for detecting this dynamical behavior using a combination of physical modeling and novel statistical signal-processing techniques based on the mathematics of SDEs. The idea of focusing attention on the spin population size appears to be more intuitive than measurements of the single point statistic T 2, the average transverse

126

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

decay time, that is commonly used in T 2-weighted imaging (Brown and Semelka, 2003). Primarily here one is concerned with estimating the population of spins that absorb energy from the RF pulse at diﬀerent locations, since this implies the spin density of (protons in) the molecular environments of interest whose energy resonances (predicted through the Zeeman interaction) match those of the designed pulse. Our discussion demonstrates how the error in the estimate of a random population interacting with the electromagnetic ﬁeld can be reduced to an arbitrarily small amount. In the MR context this suggests that a moderate/low background ﬁeld strength B0 and short pulse repetition time T R could be suﬃcient for generating real-time images. A complication posed by the high speciﬁc absorption rate SAR ∝ B02 /T R (which measures the deposition of energy in the medium in the form of heat) for short T R can presumably be overcome by using short-duration bursts of RF energy (just as in radar systems) to detect short timescale properties. In summary, the results suggest means for real-time image construction in fMRI experiments at moderate to low magnetic ﬁeld strength, and identify the choice of parameters necessary in the design of fMRI systems for the technique to be valid. There are indications that, for appropriate parameter values, the technique should succeed in extracting the real-time spin population behavior in the context of MR, without prior assumptions concerning the nature of this population needing to be made. Indeed, at the level of simulated data, this result has been veriﬁed (see ﬁg. 2 in Field, 2005). The ability to track the spin population from the observed amplitude in local time suggests the possibility of detecting spatiotemporal changes in neural activity in the brain, e.g., in the localization and tracking of evoked responses. Acknowledgments The author is grateful to John Bienenstock and Michael Noseworthy of the Brain Body Institute, Simon Haykin, and Jos´e Pr´ıncipe.

6

What Makes a Dynamical System Computationally Powerful?

Robert Legenstein and Wolfgang Maass

We review methods for estimating the computational capability of a complex dynamical system. The main examples that we discuss are models for cortical neural microcircuits with varying degrees of biological accuracy, in the context of online computations on complex input streams. We address in particular the question to what extent earlier results about the relationship between the edge of chaos and the computational power of dynamical systems in discrete time for oﬀ-line computing also apply to this case.

6.1

Introduction Most work in the theory of computations in circuits focuses on computations in feedforward circuits, probably because computations in feedforward circuits are much easier to analyze. But biological neural circuits are obviously recurrent; in fact the existence of feedback connections on several spatial scales is a characteristic property of the brain. Therefore an alternative computational theory had to be developed for this case. One neuroscientist who emphasized the need to analyze information processing in the brain in the context of dynamical systems theory was Walter Freeman, who started to write a number of inﬂuential papers on this topic in the 1960s; see Freeman (2000, 1975) for references and more recent accounts. The theoretical investigation of computational properties of recurrent neural circuits started shortly afterward. Earlier work focused on the engraving of attractors into such systems in order to restrict the dynamics to achieve well-deﬁned properties. One stream of work in this direction (see, e.g., Amari, 1972; Cowan, 1968; Grossberg, 1967; Little, 1974) culminated in the inﬂuential studies of Hopﬁeld regarding networks with stable memory, called Hopﬁeld networks (Hopﬁeld, 1982, 1984), and the work of Hopﬁeld and Tank on networks which are able to ﬁnd approximate solutions of hard combinatorial problems like the traveling salesman

128

What Makes a Dynamical System Computationally Powerful?

problem (Hopﬁeld and Tank, 1985, 1986). The Hopﬁeld network is a fully connected neural network of threshold or threshold-like elements. Such networks exhibit rich dynamics and are chaotic in general. However, Hopﬁeld assumed symmetric weights, which strongly constrains the dynamics of the system. Speciﬁcally, one can show that only point attractors can emerge in the dynamics of the system, i.e., the activity of the elements always evolves to one of a set of stable states which is then kept forever. Somewhat later the alternative idea arose to use the rich dynamics of neural systems that can be observed in cortical circuits rather than to restrict them (Buonomano and Merzenich, 1995). In addition one realized that one needs to look at online computations (rather than oﬀ-line or batch computing) in dynamical systems in order to capture the biologically relevant case (see Maass and Markram, 2005, for deﬁnitions of such basic concepts of computation theory). These eﬀorts resulted in the “liquid state machine” model by Maass et al. (2002) and the “echo state network” by Jaeger (2002), which were introduced independently. The basic idea of these models is to use a recurrent network to hold and nonlinearly transform information about the past input stream in the high-dimensional transient state of the network. This information can then be used to produce in real time various desired online outputs by simple linear readout elements. These readouts can be trained to recognize common information in dynamical changing network states because of the high dimensionality of these states. It has been shown that these models exhibit high computational power (Jaeger and Haas, 2004; Joshi and Maass, 2005; Legenstein et al., 2003). However, the analytical study of such networks with rich dynamics is a hard job. Fortunately, there exists a vast body of literature on related questions in the context of many diﬀerent scientiﬁc disciplines in the more general framework of dynamical systems theory. Speciﬁcally, a stream of research is concerned with system dynamics located at the boundary region between ordered and chaotic behavior, which was termed the “edge of chaos.” This research is of special interest for the study of neural systems because it was shown that the behavior of dynamical systems is most interesting in this region. Furthermore, a link between computational power of dynamical systems and the edge of chaos was conjectured. It is therefore a promising goal to use concepts and methods from dynamical systems theory to analyze neural circuits with rich dynamics and to get in this way better tools for understanding computation in the brain. In this chapter, we will take a tour, visiting research concerned with the edge of chaos and eventually arrive at a ﬁrst step toward this goal. The aim of this chapter is to guide the reader through a stream of ideas which we believe are inspiring for research in neuroscience and molecular biology, as well as for the design of novel computational devices in engineering. After a brief introduction of fundamental principles of dynamical systems theory and chaos in section 6.2, we will start our journey in section 6.3 in the ﬁeld of theoretical biology. There, Kauﬀman studied questions of evolution and emerging order in organisms. We will see that depending on the connectivity structure,

6.2

Chaos in Dynamical Systems Table 6.1

129

General Properties of Various Types of Dynamical Systems

Analog states?

Cellular Automata

Iterative Maps

Boolean Circuits

Cortical Microcircuits and Gene Regulation networks

no

yes

no

yes

Continuous time?

no

no

no

yes

High dimensional?

yes

no

yes

yes

With noise?

no

no

no

yes

With online input?

no

no

usually no

yes

networks may operate either in an ordered or chaotic regime. Furthermore, we will encounter the edge of chaos as a transition between these dynamic regimes. In section 6.4, our tour will visit the ﬁeld of statistical physics, where Derrida and others studied related questions and provided new methods for their mathematical analysis. In section 6.5 the reader will see how these ideas can be applied in the theory of computation. The study of cellular automata by Wolfram, Langton, Packard, and others led to the conjecture that complex computations are best performed in systems at the edge of chaos. The next stops of our journey in sections 6.6 and 6.7 will bring us close to our goal. We will review work by Bertschinger and Natschl¨ ager, who analyzed real-time computations on the edge of chaos in threshold circuits. In section 6.8, we will brieﬂy examine self-organized criticality, i.e., how a system can adapt its own dynamics toward the edge of chaos. Finally, section 6.9 presents the eﬀorts of the authors of this chapter to apply these ideas to computational questions in the context of biologically realistic neural microcircuit models. In this section we will analyze the edge of chaos in networks of spiking neurons and ask the following question: In what dynamical regimes are neural microcircuits computationally most powerful? Table 6.1 shows that neural microcircuits (as well as gene regulation networks) diﬀer in several essential aspects from those examples for dynamical systems that are commonly studied in dynamical systems theory.

6.2

Chaos in Dynamical Systems In this section we brieﬂy introduce ideas from dynamical systems theory and chaos. A few slightly diﬀerent deﬁnitions of chaos are given in the literature. Although we will mostly deal here with systems in discrete time and discrete state space, we start out with the well-established deﬁnition of chaos in continuous systems and return to discrete systems later in this section. The subject known as dynamics deals with systems that evolve in time (Strogatz, 1994). The system in question may settle down to an equilibrium, may enter a periodic trajectory (limit cycle), or do something more complicated. In Kaplan

130

What Makes a Dynamical System Computationally Powerful?

and Glass (1995) the dynamics of a deterministic system is deﬁned as being chaotic if it is aperiodic and bounded with sensitive dependence on initial conditions. The phase space for an N -dimensional system is the space with coordinates x1 , . . . , xN . The state of an N -dimensional dynamical system at time t is represented by the state vector x(t) = (x1 (t), . . . , xN (t)). If a system starts in some initial condition x(0), it will evolve according to its dynamics and describe a trajectory in state space. A steady state of the system is a state xs such that if the system evolves with xs as its initial state, it will remain in this state for all future times. Steady states may or may not be stable to small outside perturbations. For a stable steady state, small perturbations die out and the trajectory converges back to the steady state. For an unstable steady state, trajectories do not converge back to the steady state after arbitrarily small perturbations. The general deﬁnition of an attractor is a set of points or states in state space to which trajectories within some volume of state space converge asymptotically over time. This set itself is invariant under the dynamic evolution of the system.1 Therefore, a stable steady state is a zero-dimensional, or point, attractor. The set of initial conditions that evolve to an attractor A is called the basin of attraction of A. A limit cycle is an isolated closed trajectory. Isolated means that neighboring trajectories are not closed. If released in some point of the limit cycle, the system ﬂows on the cycle repeatedly. The limit cycle is stable if all neighboring trajectories approach the limit cycle. A stable limit cycle is a simple type of attractor. Higherdimensional and more complex types of attractors exist. In addition, there exist also so-called strange, or chaotic attractors. For example all trajectories in a high-dimensional state space might be brought onto a twodimensional surface of some manifold. The interesting property of such attractors is that, if the system is released in two diﬀerent experiments from two points on the attractor which are arbitrarily close to each other, the subsequent trajectories remain on the attractor surface but diverge away from each other. After a suﬃcient time, the two trajectories can be arbitrarily far apart from each other. This extreme sensitivity to initial conditions is characteristic of chaotic behavior. In fact, exponentially fast divergence of trajectories (characterized by a positive Lyapunov exponent) is often used as a deﬁnition of chaotic dynamics (see, e.g., Kantz and Schreiber, 1997). There might be a lot of structure in chaotic dynamics since the trajectory of a high-dimensional system might be projected merely onto a twodimensional surface. However, since the trajectory on the attractor is chaotic, the exact trajectory is practically not predictable (even if the system is deterministic). Systems in discrete time and with a ﬁnite discrete state space diﬀer from continuous systems in several aspects. First, since state variables are discrete, trajectories can merge, whereas in a continuous system they may merely approximate each other. Second, since there is a ﬁnite number of states, the system must eventually reenter a state previously encountered and will thereafter cycle repeatedly through this state cycle. These state cycles are the dynamical attractors of the discrete system. The set of states ﬂowing into such a state cycle or lying on it constitutes the basin of attraction of that state cycle. The length of a state cycle is the number

6.3

Randomly Connected Boolean Networks

131

of states on the cycle. For example, the memory states in a Hopﬁeld network (a network of artiﬁcial neurons with symmetric weights) are the stable states of the system. A Hopﬁeld network does not have state cycles of length larger than one. The basins of attraction of memory states are used to drive the system from related initial conditions to the same memory state, hence constituting an associative memory device. Characteristic properties of chaotic behavior in discrete systems are a large length of state cycles and high sensitivity to initial conditions. Ordered networks have short state cycles and their sensitivity to initial conditions is low, i.e., the state cycles are quite stable. We note that state cycles can be stable with respect to some small perturbations but unstable to others. Therefore, “quite stable” means in this context that the state cycle is stable to a high percentage of small perturbations. These general deﬁnitions are not very precise and will be made more speciﬁc for each of the subsequent concrete examples.

6.3

Randomly Connected Boolean Networks The study of complex systems is obviously important in many scientiﬁc areas. In genetic regulatory networks, thousands or millions of coupled variables orchestrate developmental programs of an organism. In 1969, Kauﬀman started to study such systems in the simpliﬁed model of Boolean networks (Kauﬀman, 1969, 1993). He discovered some surprising results which will be discussed in this section. We will encounter systems in the ordered and in the chaotic phase. The speciﬁc phase depends on some simple structural feature of the system, and a phase transition will occur when this feature changes. A Boolean network consists of N elements and connections between them. The state of its elements is described by binary variables x1 , . . . , xN . The dynamical behavior of each variable, whether it will be active (1) or inactive (−1) at the next time step, is governed by a Boolean function.2 The (directed) connections between the elements describe possible interactions. If there is a connection from element i to element j, then the state of element i inﬂuences the state of element j in the next time step. We say that i is an input of element j. An initial condition is given by a value for each variable x(0). Thereafter, the state of each element evolves according to the Boolean function assigned to it. We can describe the dynamics of the system by a set of iterated maps x1 (t + 1) = f1 (x1 (t), . . . , xN (t)) .. . xN (t + 1) = fN (x1 (t), . . . , xN (t)), where f1 , . . . , fN are Boolean functions.3 Here, all state variables are updated in parallel at each time step. The stability of attractors in Boolean networks can be studied with respect to

132

What Makes a Dynamical System Computationally Powerful?

minimal perturbations. A minimal perturbation is just the ﬂip of the activity of a single variable to the opposite state. Kauﬀman studied the dynamics of Boolean networks as a function of the number of elements in the network N , and the average number K of inputs to each element in the net. Since he was not interested in the behavior of particular nets but rather in the expected dynamics of nets with some given N and K, he sampled at random from the ensemble of all such networks. Thus the K inputs to each element were ﬁrst chosen at random and then ﬁxed, and the Boolean function assigned to each element was also chosen at random and then ﬁxed. For each such member of the ensemble, Kauﬀman performed computer simulations and examined the accumulated statistics. The case K = N is especially easy to analyze. Since the Boolean function of each element was chosen randomly from a uniform distribution, the successor to each circuit state is drawn randomly from a uniform distribution among the 2N possible states. This leads to long state cycles. The median state cycle length is 0.5 · 2N/2 . Kauﬀman called such exponentially long state cycles chaotic.4 These state cycles are unstable to most perturbations, hence there is a strong dependence on initial conditions. However, only a few diﬀerent state cycles exit in this case: the expected number of state cycles is N/e. Therefore, there is some characteristic structure in the chaotic behavior in the sense that the system will end up in one of only a few long-term behaviors. As long as K is not too small, say K ≥ 5, the main features of the case K = N persist. The dynamics is still governed by relatively few state cycles of exponential length, whose expected number is at most linear in N . For K ≥ 5, these results can be derived analytically by a rough mean ﬁeld approximation. For smaller K, the approximation becomes inaccurate. However, simulations conﬁrm that exponential state cycle length and a linear number of state cycles are characteristic for random Boolean networks with K ≥ 3. Furthermore, these systems show high sensitivity to initial conditions (Kauﬀman, 1993). The case K = 2 is of special interest. There, a phase transition from ordered to chaotic dynamics occurs. Numerical simulations of these systems have revealed the following characteristic features of random Boolean networks with √ K = 2 (Kauﬀman, 1969). The expected median state cycle length is about N . Thus, random Boolean networks with K = 2 often conﬁne their dynamical behavior to tiny subvolumes of their state space, a strong sign of order. A more detailed analysis shows that most state cycles are √ short, whereas there are a few long ones. The number of state cycles is about N and they are inherently stable to about 80% to 90% of all minimal transient perturbations. Hence, the state cycles of the system have large basins of attraction and the sensitivity to initial conditions is low. In addition to these characteristics which stand in stark contrast to networks of larger K, we want to emphasize three further features. First, typically at least 70% of the N elements have some ﬁxed active or inactive state which is identical for all the existing state cycles of the Boolean network. This behavior establishes a frozen core of elements. The frozen core creates walls of constancy which break the system into

6.4

The Annealed Approximation by Derrida and Pomeau

133

functionally isolated islands of unfrozen elements. Thus, these islands are prevented from inﬂuencing one another. The boundary regime where the frozen core is just breaking up and interaction between the unfrozen islands becomes possible is the phase transition between order and chaos. Second, altering transiently the activity of a single element typically propagates but causes only alterations in the activity of a small fraction of the elements in the system. And third, deleting any single element or altering its Boolean function typically causes only modest changes in state cycles and transients. The latter two points ensure that “damage” of the system is small. We will further discuss this interesting case in the next section. Networks with K = 1 operate in an ordered regime and are of little interest for us here.

6.4

The Annealed Approximation by Derrida and Pomeau The phase transition from order to chaos is of special interest. As we shall see in the sections below, there are reasons to believe that this dynamical regime is particularly well suited for computations. There were several attempts to understand the emerging order in random Boolean networks. In this section, we will review the approach of Derrida and Pomeau (1986). Their beautiful analysis gives an analytical answer to the question of where such a transition occurs. In the original model, the connectivity structure and the Boolean functions fi of the elements i were chosen randomly but were then ﬁxed. The dynamics of the network evolved according to this ﬁxed network. In this case the randomness is quenched because the functions fi and the connectivity do not change with time. Derrida and Pomeau presented a simple annealed approximation to this model which explains why there is a critical value Kc of K where the transition from order to chaos appears. This approximation also allowed the calculation of many properties of the model. In contrast to the quenched model, the annealed approximation randomly reassigns the connectivity and the Boolean functions of the elements at each time step. Although the assumption of the annealed approximation is quite drastic, it turns out that its agreement with observations in simulations of the quenched model is surprisingly good. The beneﬁts of the annealed model will become clear below. It was already pointed out that exponential state cycle length is an indicator of chaos. In the annealed approximation, however, there are no ﬁxed state cycles because the network is changed at every time step. Therefore, the calculations are based on the dependence on initial conditions. Consider two network states C1 , C2 ∈ {−1, 1}N . We deﬁne the Hamming distance d(C1 , C2 ) as the number of positions in which the two states are diﬀerent. The question is whether two randomly chosen diﬀerent initial network states eventually converge to the same pattern of activity over time. Or, stated in other words, given an initial state C1 (t) which leads to a state C1 at time t and a diﬀerent initial state C2 which leads to (t) (t) (t) a state C2 at time t, will the Hamming distance d(C1 , C2 ) converge to zero for

What Makes a Dynamical System Computationally Powerful?

1

y

=y

t+1

0.75

yt+1

134

t

K=5

0.5

K=3 K=2

0.25

0

0

0.25

0.5

yt

0.75

1

Expected distance between two states at time t + 1 as a function of the state distance yt between two states at time t, based on the annealed approximation. Points on the diagonal yt = yt+1 are ﬁxed points of the map. The curves for K ≥ 3 all have ﬁxed points for a state distance larger than zero. The curve for K = 2 stays close to the diagonal for small state distances but does not cross it. Hence, for K = 2 state distances converge to zero for iterated applications of the map. Figure 6.1

large t? Derrida and Pomeau found that this is indeed the case for K ≤ 2. For K ≥ 3, the trajectories will diverge. To be more precise, one wants to know the probability P1 (m, n) that the distance d(C1 , C2 ) between the states at time t = 1 is m given that the distance d(C1 , C2 ) at time t = 0 was n. More generally, one wants to estimate the probability (t) (t) Pt (m, n) that the network states C1 , C2 obtained at time t are at distance m, given that d(C1 , C2 ) = n at time t = 0. It now becomes apparent why the annealed approximation is useful. In the annealed approximation, the state transition probabilities at diﬀerent time steps are independent, which is not the case in the quenched model. For large N , one can introduce continuous variables n annealed (m, n) for the annealed N = x. Derrida and Pomeau (1986) show that P1 network has a peak around a value m = N y1 where y1 is given by y1 =

1 − (1 − x)K . 2

(6.1)

Similarly, the probability Ptannealed (m, n) has a peak at m = N yt with yt given by yt =

1 − (1 − yt−1 )K 2

(6.2)

for t > 1. The behavior of this iterative map can be visualized in the so called Derrida plot; see ﬁg. 6.1. The plot shows the state distance at time t + 1 as a

6.5

Computation at the Edge of Chaos in Cellular Automata

135

function of the state distance at time t. Points on the diagonal yt = yt+1 are ﬁxed points of the map. For K ≤ 2, the ﬁxed point y = 0 is the only ﬁxed point of the map and it is stable. In fact, for any starting value y1 , we have yt → 0 for t → ∞ in the limit of N → ∞. For K > 2, the ﬁxed point y = 0 becomes unstable and a new stable ﬁxed point y ∗ appears. Therefore, the state distance need no longer always converge to zero. Hence there is a phase transition of the system at K = 2. The theoretical work of Derrida and Pomeau was important because before there was only empirical evidence for this phase transition. We conclude that there exists an interesting transition region from order to chaos in these dynamical systems. For simpliﬁed models, this region can be determined analytically. In the following section we will ﬁnd evidence that such phase transitions are of great interest for the computational properties of dynamical systems.

6.5

Computation at the Edge of Chaos in Cellular Automata Evidence that systems exhibit superior computational properties near a phase transition came from the study of cellular automata. Cellular automata are quite similar to Boolean networks. The main diﬀerences are that connections between elements are local, and that an element may assume one out of k possible states at each time step (instead of merely two states as in Boolean networks). The former diﬀerence implies that there is a notion of space in a cellular automaton. More precisely, a d-dimensional space is divided into cells (the elements of the network). The state of a cell at time t+1 is a function only of its own state and the states of its immediate neighbors at time t. The latter diﬀerence is made explicit by deﬁning a ﬁnite set Σ of cell states. The transition function Δ is a mapping from neighborhood states (including the cell itself) to the set of cell states. If the neighborhood is of size L, we have Δ : ΣL → Σ. What do we mean by “computation” in the context of cellular automata? In one common meaning, the transition function is interpreted as the program and the input is given by the initial state of the cellular automaton. Then, the system evolves for some speciﬁed number of time steps, or until some “goal pattern”— possibly a stable state—is reached. The ﬁnal pattern is interpreted as the output of the automaton (Mitchell et al., 1993). In analogy to universal Turing machines, it has been shown that cellular automata are capable of universal computation (see, e.g., Codd, 1968; Smith, 1971; von Neumann, 1966). That is, there exist cellular automata which, by getting the algorithm to be applied as part of their initial conﬁguration, can perform any computation which is computable by any Turing machine. In 1984, Wolfram conjectured that such powerful automata are located in a special dynamical regime. Later, Langton identiﬁed this regime to lie on a phase transition between order and chaos (see below), i.e., in the regime which corresponds

136

What Makes a Dynamical System Computationally Powerful?

Evolution of one-dimensional cellular automata. Each horizontal line represents one automaton state. Successive time steps are shown as successive horizontal lines. Sites with value 1 are represented by black squares; sites with value 0 by white squares. One example each for an automaton of class 1 (left), class 4 (middle), and class 3 (right) is given. Figure 6.2

to random Boolean networks with K = 2. Wolfram presented a qualitative characterization of one-dimensional cellular automaton behavior where the individual automata diﬀered by their transfer function.5 He found evidence that all one-dimensional cellular automata fall into four distinct classes (Wolfram, 1984). The dynamics for three of these classes are shown in ﬁg. 6.2. Class 1 automata evolve to a homogeneous state, i.e., a state where all cells are in the same state. Hence these systems evolve to a simple steady state. Class 2 automata evolve to a set of separated simple stable states or separated periodic structures of small length. These systems have short state cycles. Both of these classes operate in the ordered regime in the sense that state cycles are short. Class 3 automata evolve to chaotic patterns. Class 4 automata have long transients, and evolve “to complex localized structures” (Wolfram, 1984). Class 3 automata are operating in the chaotic regime. By chaotic, Wolfram refers to the unpredictability of the exact automaton state after a few time steps. Successor states look more or less random. He also talks about nonperiodic patterns. Of course these patterns are periodic if the automaton is of ﬁnite size. But in analogy with the results presented above, one can say that state cycles are very long. Transients are the states that emerge before the dynamics reaches a stable long-lasting behavior. They appear at the beginning of the state evolution. Once the system is on a state cycle, it will never revisit such transient states. The transients of class 4 automata can be identiﬁed with large basins of attraction or high stability of state cycles. Wolfram conjectured that class 4 automata are capable of universal computations. In 1990, Langton systematically studied the space of cellular automata considered by Wolfram with respect to an order parameter λ (Langton, 1990). This

6.5

Computation at the Edge of Chaos in Cellular Automata

137

parameter λ determines a crucial property of the transfer function Δ: the fraction of entries in Δ which do not map to some prespeciﬁed quiescent state sq . Hence, for λ = 0, all local conﬁgurations map to sq , and the automaton state moves to a homogeneous state after one time step for every initial condition. More generally, low λ values lead to ordered behavior. Rules with large λ tend to produce a completely diﬀerent behavior. Langton (1990) stated the following question: “Under what conditions will physical systems support the basic operations of information transmission, storage, and modiﬁcation constituting the capacity to support computation?” When Langton went through diﬀerent λ values in his simulations, he found that all automaton classes of Wolfram appeared in this parameterization. Moreover, he found that the interesting class 4 automata can be found at the phase transition between ordered and chaotic behavior for λ values between about 0.45 and 0.5, values of intermediate heterogeneity. Information-theoretic analysis supported the conjectures of Wolfram, indicating that the edge of chaos is the dominant region of computationally powerful systems. Further evidence for Wolfram’s hypothesis came from Packard (1988). Packard used genetic algorithms to genetically evolve one-dimensional cellular automata for a simple computational task. The goal was to develop in this way cellular automata which behave as follows: The state of the automaton should converge to the all-one state (i.e., the state where every cell is in state 1), if the fraction of one-states in the initial conﬁguration is larger than 0.5. If the fraction of one-states in the initial conﬁguration is below 0.5, it should evolve to the all-zero state. Mutations were accomplished by changes in the transfer function (point mutations which changed only a single entry in the rule table, and crossover which merged two rule tables into a single one). After applying a standard genetic algorithm procedure to an initial set of cellular automaton rules, he examined the rule tables of the genetically evolved automata. The majority of the evolved rule tables had λ values either around 0.23 or around 0.83. These are the two λ values where the transition from order to chaos appears for cellular automata with two states per cell.6 “Thus, the population appears to evolve toward that part of the space of rules that marks the transition to chaos” (Packard, 1988). These results have later been criticized (Mitchell et al., 1993). Mitchell and collaborators reexamined the ideas of Packard and performed similar simulations with a genetic algorithm. The results of these investigations diﬀered from Packard’s results. The density of automata after evolution was symmetrically peaked around λ = 0.5, but much closer to 0.5 and deﬁnitely not in the transition region. They argued that the optimal λ value for a task should strongly depend on the task. Speciﬁcally, in the task considered by Packard one would expect a λ value close to 0.5 for a well-performing rule, because the task is symmetric with respect to the exchange of ones and zeros. A rule with λ < 0.5 tends to decrease the number of ones in the state vector because more entries in the rule table map the state to zero. This can lead to errors if the number of ones in the initial state is slightly larger than 0.5. Indeed, a rule which performs very well on this task, the Gacs-

138

What Makes a Dynamical System Computationally Powerful?

Kurdyumov-Levin (GKL) rule, has λ = 0.5. It was suggested that artifacts in the genetic algorithm could account for the diﬀerent results. We want to return here to the notion of computation. Wolfram and Langton were interested in universal computations. Although universality results for automata are mathematically interesting, they do not contribute much to the goal of understanding computations in biological neural systems. Biological organisms usually face computational tasks which are quite diﬀerent from the oﬀ-line computations on discrete batch inputs for which Turing machines are designed. Packard was interested in automata which perform a speciﬁc kind of computation with the transition function being the “program.” Mitchell at al. showed that there are complex tasks for which the best systems are not located at the edge of chaos. In Mitchell et al. (1993), a third meaning of computation in cellular automata—a kind of “intrinsic” computation—is mentioned: “Here, computation is not interpreted as the performance of a ’useful’ transformation of the input to produce the output. Rather, it is measured in terms of generic, structural computational elements such as memory, information production, information transfer, logical operations, and so on. It is important to emphasize that the measurement of such intrinsic computational elements does not rely on a semantics of utility as do the preceding computational types” (Mitchell et al., 1993). It is worthwhile to note that this “intrinsic” computation in dynamical systems can be used by a readout unit which maps system states to desired outputs. This is the basic idea of the liquid state machine and echo state networks, and it is the basis of the considerations in the following sections. To summarize, systems at the edge of chaos are believed to be computationally powerful. However, the type of computations considered so far are considerably diﬀerent from computations in organisms. In the following section, we will consider a model of computation better suited for our purposes.

6.6

The Edge of Chaos in Systems with Online Input Streams All previously considered computations were oﬀ-line computations where some initial state (the input) is transformed by the dynamics into a terminal state or state cycle (the output). However, computation in biological neural networks is quite diﬀerent from computations in Turing machines or other traditional computational models. The input to an organism is a continuous stream of data and the organism reacts in real time (i.e., within a given time interval) to information contained in this input. Hence, as opposed to batch processing, the input to a biological system is a time varying signal which is mapped to a time-varying output signal. Such mappings are also called ﬁlters. In this section, we will have a look at recent work on real-time computations in threshold networks by Bertschinger and Natschl¨ ager (2004); see also Natschl¨ager et al. (2005)). Results of experiments with closely related hardware models are reported in Schuermann et al. (2005).

6.6

The Edge of Chaos in Systems with Online Input Streams

139

Threshold networks are special cases of Boolean networks consisting of N elements (units) with states xi ∈ {−1, 1}, i = 1, . . . , N . In networks with online input, the state of each element depends on the state of exactly K randomly chosen other units, and in addition on an external input signal u(·) (the online input). At each time step, u(t) assumes the value u ¯ + 1 with probability r and the value u ¯−1 with probability 1 − r. Here, u ¯ is a constant input bias. The transfer function of the elements is not an arbitrary Boolean function but a randomly chosen threshold function of the form ⎞ ⎛ N wij xj (t) + u(t + 1)⎠ , (6.3) xi (t + 1) = Θ ⎝ j=1

where wij ∈ R is the weight of the connection from element j to element i and Θ(h) = +1 if h ≥ 0 and Θ(h) = −1 otherwise. For each element, exactly K of its incoming weights are nonzero and chosen from a Gaussian distribution with zero mean and variance σ 2 . Diﬀerent dynamical regimes of such circuits are shown in ﬁg. 6.3. The top row shows the online input, and below, typical activity patterns of networks with ordered, critical, and chaotic dynamics. The system parameters for each of these circuits are indicated in the phase plot below. The variance σ 2 of nonzero weights was varied to achieve the diﬀerent dynamics. The transition from the ordered to the chaotic regime is referred to as the critical line. Bertschinger and Natschl¨ ager used the approach of Derrida to determine the dynamical regime of these systems. They analyzed the change in Hamming distance between two (initial) states and their successor states provided that the same input is applied in both situations. Using Derrida’s annealed approximation, one can calculate the Hamming distance d(t + 1) given the Hamming distance d(t) of the states at time t. If arbitrarily small distances tend to increase, the network operates in the chaotic phase. If arbitrarily small distances tend to decrease, the network operates in the ordered phase. This can also be expressed by the stability of the ﬁxed point d∗ = 0. In the ordered phase, this ﬁxed point is the only ﬁxed point and it is stable. In the chaotic phase, another ﬁxed point appears and d∗ = 0 becomes unstable. The ﬁxed point 0 is stable if the absolute value of the slope of the map at d(t) = 0, ∂d(t + 1) α= , ∂d(t) d(t)=0 is smaller than 1. Therefore, the transition from order to chaos (the critical line) is given by the line |α| = 1. This line can be characterized by the equation u + 1) + (1 − r)PBF (¯ u − 1) = rPBF (¯

1 , K

(6.4)

where the bit-ﬂip probability PBF (v) is the probability that a single changed state component in the K inputs to a unit that receives the current online input v leads to a change of the output of that unit. This result has a nice interpretation. Consider

140

What Makes a Dynamical System Computationally Powerful?

Threshold networks with online input streams in diﬀerent dynamical regimes. The top row shows activity patterns for ordered (left), critical (middle), and chaotic behavior (right). Each vertical line represents the activity in one time step. Black (white) squares represent sites with value 1 (−1). Successive vertical lines represent successive circuit states. The input to the network is shown above the plots. The parameters σ 2 and u¯ of these networks are indicated in the phase plot below. Further parameters: number of input connections K = 4, number of elements N = 250. Figure 6.3

a value of r = 1, i.e., the input to the network is constant. Consider two network states C1 , C2 which diﬀer only in one state component. This diﬀerent component is on average mapped to K elements (because each gate receives K inputs, hence there are altogether N · K connections). If the bit-ﬂip probability in each of these units is larger than 1/K, then more than one of these units will diﬀer on average in the successor states C1 , C2 . Hence, diﬀerences are ampliﬁed. If the bit-ﬂip probability of each element is smaller than 1/K, the diﬀerences will die out on average.

6.7

Real-Time Computation in Dynamical Systems In the previous section we were interested in the dynamical properties of systems with online input. The work we discussed there was inﬂuenced by recent ideas concerning computation in neural circuits that we will sketch in this section. The idea to use the rich dynamics of neural systems which can be observed in cortical circuits, rather than to restrict them, resulted in the liquid state machine

6.7

Real-Time Computation in Dynamical Systems

141

model by Maass et al. (2002) and the echo state network by Jaeger (2002).7 They assume time series as inputs and outputs of the system. A recurrent network is used to hold nonlinearly transformed information about the past input stream in the state of the network. It is followed by a memoryless readout unit which simply looks at the current state of the circuit. The readout can then learn to map the current state of the system onto some target output. Superior performance of echo state networks for various engineering applications is suggested by the results of Jaeger and Haas (2004). The requirement that the network is operating in the ordered phase is important in these models, although it is usually described with a diﬀerent terminology. The ordered phase can be described by using the notion of fading memory (Boyd and Chua, 1985). Time-invariant fading memory ﬁlters are exactly those ﬁlters which can be represented by Volterra series. Informally speaking, a network has fading memory if its state at time t depends (up to some ﬁnite precision) only on the values (up to some ﬁnite precision) of its input from some ﬁnite time window [t − T, t] into the past (Maass et al., 2002). This is essentially equivalent to the requirement that if there are no longer any diﬀerences in the online inputs then the state diﬀerences converge to 0, which is called echo state property in Jaeger (2002). Besides the fading memory property, another property of the network is important for computations on time series: the pairwise separation property (Maass et al., 2002). Roughly speaking, a network has the pairwise separation property if for any two input time series which diﬀered in the past, the network assumes at subsequent time points diﬀerent states. Chaotic networks have such separation property, but they do not have fading memory since diﬀerences in the initial state are ampliﬁed. On the other hand, very ordered systems have fading memory but provide weak separation. Hence, the separation property and the fading memory property are antagonistic. Ideally, one would like to have high separation on salient diﬀerences in the input stream but still keep the fading memory property (especially for variances in the input stream that do not contribute salient information). It is therefore of great interest to analyze these properties in models for neural circuits. A ﬁrst step in this direction was made by Bertschinger and Natschl¨ ager (2004) in the context of threshold circuits. Similar to section 6.6, one can analyze the evolution of the state separation resulting from two input streams u1 and u2 which diﬀer at time t with some probability. The authors deﬁned the network-mediated separation (short: N M -separation) of a network. Informally speaking, the N M separation is roughly the amount of state distance in a network which results from diﬀerences in the input stream minus the amount of state diﬀerence resulting from diﬀerent initial states. Hence, the N M -separation has a small value in the ordered regime, where both terms are small, but also in the chaotic regime, where both terms are large. Indeed, it was shown that the N M -separation peaks at the critical line, which is shown in ﬁg. 6.4a. Hence, Bertschinger and Natschl¨ ager (2004) oﬀer a new interpretation for the critical line and provide a more direct link between the edge of chaos and computational power.

142

What Makes a Dynamical System Computationally Powerful?

a

b

The network-mediated separation and computational performance for a 3-bit parity task with diﬀerent settings of parameters σ 2 and u¯. (a) The NMseparation peaks at the critical line. (b) High performance is achieved near the critical line. The performance is measured in terms of the memory capacity M C (Jaeger, 2002). The memory capacity is deﬁned as the mutual information M I between the network output and the target function summed over all delays τ > 0 on a test set. More formally, M C = ∞ τ =0 M I(vτ , yτ ), where vτ (·) denotes the network output and yτ (t) = P ARIT Y (u(t − τ ), u(t − τ − 1), u(t − τ − 2)) is the target output. Figure 6.4

Since the separation property is important for the computational properties of the network, one would expect that the computational performance peaks near the critical line. This was conﬁrmed with simulations where the computational task was to compute the delayed 3-bit parity8 of the input signal. The readout neuron was implemented by a simple linear classiﬁer C(x(t)) = Θ(w · x(t) + w0 ) which was trained with linear regression. Note that the parity task is quite complex since it partitions the set of all inputs into two classes which are not linearly separable (and can therefore not be represented by the linear readout alone), and it requires memory. Figure 6.4b shows that the highest performance is achieved for parameter values close to the critical line, although it is not clear why the performance drops for increasing values of u ¯. In contrast to preceding work (Langton, 1990; Packard, 1988), the networks used were not optimized for a speciﬁc task. Only the linear readout was trained to extract the speciﬁc information from the state of the system. This is important since it decouples the dynamics of the network from a speciﬁc task.

6.8

Self-Organized Criticality Are there systems in nature with dynamics located at the edge of chaos? Since the edge of chaos is a small boundary region in the space of possible dynamics, only a vanishingly small fraction of systems should operate in this dynamical regime. However, it was argued that such “critical” systems are abundant in nature (see,

6.8

Self-Organized Criticality

143

e.g., Bak et al., 1988). How is this possible if critical dynamics may occur only accidentally in nature? Bak and collaborators argue that a class of dissipative coupled systems naturally evolve toward critical dynamics (Bak et al., 1988). This phenomenon was termed self-organized criticality (SOC) and it was demonstrated with a model of a sand pile. Imagine building up a sand pile by randomly adding sand to the pile, a grain at a time. As sand is added, the slope will increase. Eventually, the slope will reach a critical value. Whenever the local slope of the pile is too steep, sand will slide oﬀ, therefore reducing the slope locally. On the other hand, if one starts with a very steep pile it will collapse and reach the critical slope from the other direction. In neural systems, the topology of the network and the synaptic weights strongly inﬂuence the dynamics. Since the amount of genetically determined connections between neurons is limited, self-organizing processes during brain development as well as learning processes are assumed to play a key role in regulating the dynamics of biological neural networks (Bornholdt and R¨ ohl, 2003). Although the dynamics is a global property of the network, biologically plausible learning rules try to estimate the global dynamics from information available at the local synaptic level and they only change local parameters. Several SOC rules have been suggested (Bornholdt and R¨ ohl, 2003, 2000; Christensen et al., 1998; Natschl¨ ager et al., 2005). In Bornholdt and R¨ ohl (2003), the degree of connectivity was regulated in a locally connected network (i.e., only neighboring neurons are connected) with stochastic state update dynamics. A local rewiring rule was used which is related to Hebbian learning. The main idea of this rule is that the average correlation between the activities of two neurons contains information about the global dynamics. This rule only relies on information available on the local synaptic level. Self-organized criticality in systems with online input streams (as discussed in section 6.6) was considered in Natschl¨ ager et al. (2005). According to section 6.6, the dynamics of a threshold network is at the critical line if the bit-ﬂip probability PBF (averaged over the external and internal input statistics) is equal to 1/K, where K is the number of inputs to a unit. The idea is to estimate the bit-ﬂip probability of a unit by the mean distance of the internal activation of that unit from the ﬁring threshold. This distance is called the margin. Intuitively, a node with an activation much higher or lower than its ﬁring threshold is rather unlikely to change its output if a single bit in its inputs is ﬂipped. Each node i then applies synaptic scaling to its weights wij in order to adjust itself toward the critical line: ' esti 1 1 if PBF (t) > K 1+ν · wij (t) wij (t + 1) = , (6.5) esti 1 (1 + ν) · wij (t) if PBF (t) < K esti where 0 < ν 1 is the learning rate and PBF (t) is an estimate of the bit-ﬂip i probability PBF of unit i. It was shown by simulations that this rule keeps the dynamics in the critical regime, even if the input statistics change. The computational capabilities of randomly chosen circuits with this synaptic scaling rule acting online during computation were tested in a setup similar to that discussed in section 6.7.

144

What Makes a Dynamical System Computationally Powerful?

The performance of these networks was as high as for circuits where the parameters were a priori chosen in the critical regime, and they stayed in this region. This shows that systems can perform speciﬁc computations while still being able to react to changing input statistics in a ﬂexible way.

6.9

Toward the Analysis of Biological Neural Systems Do cortical microcircuits operate at the edge of chaos? If biology makes extensive use of the rich internal dynamics of cortical circuits, then the previous considerations would suggest this idea. However, the neural elements in the brain are quite diﬀerent from the elements discussed so far. Most important, biological neurons communicate with spikes, discrete events in continuous time. In this section, we will investigate the dynamics of spiking circuits and ask: In what dynamical regimes are neural microcircuits computationally powerful? We propose in this section a conceptual framework and new quantitative measures for the investigation of this question (see also Maass et al., 2005). In order to make this approach feasible, in spite of numerous unknowns regarding synaptic plasticity and the distribution of electrical and biochemical signals impinging on a cortical microcircuit, we make in the present ﬁrst step of this approach the following simplifying assumptions: 1. Particular neurons (“readout neurons”) learn via synaptic plasticity to extract speciﬁc information encoded in the spiking activity of neurons in the circuit. 2. We assume that the cortical microcircuit itself is highly recurrent, but that the impact of feedback that a readout neuron might send back into this circuit can be neglected.9 3. We assume that synaptic plasticity of readout neurons enables them to learn arbitrary linear transformations. More precisely, we assume that the input to such &n−1 readout neurons can be approximated by a term i=1 wi xi (t), where n − 1 is the number of presynaptic neurons, xi (t) results from the output spike train of the ith presynaptic neuron by ﬁltering it according to the low-pass ﬁltering property of the membrane of the readout neuron,10 and wi is the eﬃcacy of the synaptic connection. Thus wi xi (t) models the time course of the contribution of previous spikes from the ith presynaptic neuron to the membrane potential at the soma of this readout neuron. We will refer to the vector x(t) as the “circuit state at time t” (although it is really only that part of the circuit state which is directly observable by readout neurons). All microcircuit models that we consider are based on biological data for generic cortical microcircuits (as described in section 6.9.1), but have diﬀerent settings of their parameters.

Toward the Analysis of Biological Neural Systems

6.9.1

145

Models for Generic Cortical Microcircuits

Our empirical studies were performed on a large variety of models for generic cortical microcircuits (we refer to Maass et al., 2004, , for more detailed deﬁnitions and explanations). All circuit models consisted of leaky integrate-and-ﬁre neurons11 and biologically quite realistic models for dynamic synapses.12 Neurons (20% of which were randomly chosen to be inhibitory) were located on the grid points of a 3D grid of dimensions 6×6×15 with edges of unit length. The probability of a synaptic connection from neuron a to neuron b was proportional to exp(−D2 (a, b)/λ2 ), where D(a, b) is the Euclidean distance between a and b, and λ is a spatial connectivity constant (not to be confused with the λ parameter used by Langton). Synaptic eﬃciencies w were chosen randomly from distributions that reﬂect biological data (as in Maass et al., 2002), with a common scaling factor Wscale . 8

b

0.7

0

50

100 150 200 t [ms]

0

50

100 150 200 t [ms]

0.035 0.03 0.025

4 2

a Wscale

6.9

1 0.7 0.5 0.3

3

0.65

2 1

SD

0.6

0.1 0.05 0.5

1 1.4 2 λ

3 4

6 8

Figure 6.5 Performance of diﬀerent types of neural microcircuit models for classiﬁcation of spike patterns. (a) In the top row are two examples of the 80 spike patterns that were used (each consisting of 4 Poisson spike trains at 20 Hz over 200 ms), and in the bottom row are examples of noisy variations (Gaussian jitter with SD 10 ms) of these spike patterns which were used as circuit inputs. (b) Fraction of examples (for 200 test examples) that were correctly classiﬁed by a linear readout (trained by linear regression with 500 training examples). Results are shown for 90 diﬀerent types of neural microcircuits C with λ varying on the x-axis and Wscale on the y -axis (20 randomly drawn circuits and 20 target classiﬁcation functions randomly drawn from the set of 280 possible classiﬁcation functions were tested for each of the 90 diﬀerent circuit types, and resulting correctness rates were averaged). Circles mark three speciﬁc choices of λ, Wscale pairs for comparison with other ﬁgures; see ﬁg. 6.6. The standard deviation of the result is shown in the inset on the upper right.

Linear readouts from circuits with n − 1 neurons were assumed to compute a &n−1 weighted sum i=1 wi xi (t) + w0 (see section 6.9). In order to simplify notation we assume that the vector x(t) contains an additional constant component x0 (t) = 1, so &n−1 that one can write w ·x(t) instead of i=1 wi xi (t)+w0 . In the case of classiﬁcation tasks we assume that the readout outputs 1 if w · x(t) ≥ 0, and 0 otherwise.

146

What Makes a Dynamical System Computationally Powerful?

In order to investigate the inﬂuence of synaptic connectivity on computational performance, neural microcircuits were drawn from this distribution for 10 diﬀerent values of λ (which scales the number and average distance of synaptically connected neurons) and 9 diﬀerent values of Wscale (which scales the eﬃcacy of all synaptic connections). Twenty microcircuit models C were drawn for each of these 90 diﬀerent assignments of values to λ and Wscale . For each circuit a linear readout was trained to perform one (randomly chosen) out of 280 possible classiﬁcation tasks on noisy variations u of 80 ﬁxed spike patterns as circuit inputs u. See ﬁg. 6.5 for two examples of such spike patterns. The target performance of any such circuit was to output at time t = 200 ms the class (0 or 1) of the spike pattern from which the preceding circuit input had been generated (for some arbitrary partition of the 80 ﬁxed spike patterns into two classes). Each spike pattern u consisted of four Poisson spike trains over 200 ms. Performance results are shown in ﬁg. 6.5b for 90 diﬀerent types of neural microcircuit models. 6.9.2

Locating the Edge of Chaos in Neural Microcircuit Models

It turns out that the previously considered characterizations of the edge of chaos are not too successful in identifying those parameter values in the map of ﬁg. 6.5b that yield circuits with large computational power (Maass et al., 2005). The reason is that large initial state diﬀerences (as they are typically caused by diﬀerent spike input patterns) tend to yield for most values of the circuit parameters nonzero state diﬀerences not only while the online spike inputs are diﬀerent, but also long afterward when the online inputs agree during subsequent seconds (even if the random internal noise is identical in both trials). But if one applies the deﬁnition of the edge of chaos via Lyapunov exponents (see Kantz and Schreiber, 1997), the resulting edge of chaos lies for the previously introduced type of computations (classiﬁcation of noisy spike templates by a trained linear readout) in the region of the best computational performance (see the map in ﬁg. 6.5b, which is repeated for easier comparison in ﬁg. 6.6d). For this deﬁnition one looks for the exponent μ ∈ R that provides through the formula δΔT ≈ δ0 · eμΔT the best estimate of the state separation δΔT at time ΔT after the computation was started in two trials with an initial state diﬀerence δ0 . We generalize this analysis to the case with online input by choosing exactly the same online input (and the same random noise) during the intervening time interval of length ΔT , and by averaging the resulting state diﬀerences δΔT over many random choices of such online inputs (and internal noise). As in the classical case with oﬀ-line input it turns out to be essential to apply this estimate for δ0 → 0, since δΔT tends to saturate for each ﬁxed value δ0 . This can be seen in ﬁg. 6.6a, which shows results of this experiment for a δ0 that results from moving a single spike that occurs in the online input at time t = 1s by 0.5 ms. This experiment was repeated for three

Toward the Analysis of Biological Neural Systems

a

4

147

state separation for a spike displacement of 0.5 msec. at time t=1 sec.

circuit 3

0 0.04

circuit 2

0 0.001

0

1

2

3

3

0.5 ms

4

c

5 t [s]

6

scale

0

9

10

8

0.7

4

2 1

0.68

2

3

1 0.7 0.5 0.3

8

d

2 1

7

8 4

1 ms 2 ms

2

W

b

circuit 1

Wscale

0

Lyapunov exponent μ

6.9

μ=1

3

1 0.7 0.5 0.3

μ=0

0.66 0.64

2

0.62

1

0.6

−1 0.1 −2

1.9

2

2.1

λ

2.2

2.3

0.05 0.5

1 1.4

2

λ

3

4

6

0.58

0.1

μ=−1 8

0.05 0.5

0.56 1 1.4 2

λ

3 4

6 8

Analysis of small input diﬀerences for diﬀerent types of neural microcircuit models as speciﬁed in section 6.9.1. Each circuit C was tested for two arrays u and v of 4 input spike trains at 20 Hz over 10 s that diﬀered only in the timing of a single spike at time t = 1 s. (a) A spike at time t = 1 s was delayed by 0.5 ms. Temporal evolution of Euclidean diﬀerences between resulting circuit states xu (t) and xv (t) with 3 diﬀerent values of λ, Wscale according to the three points marked in panel c. For each parameter pair, the average state diﬀerence of 40 randomly drawn circuits is plotted. (b) Lyapunov exponents μ along a straight line between the points marked in panel c with diﬀerent delays of the delayed spike. The delay is denoted on the right of each line. The exponents were determined for the average state diﬀerence of 40 randomly drawn circuits. (c) Lyapunov exponents μ for 90 diﬀerent types of neural microcircuits C with λ varying on the x-axis and Wscale on the y-axis (the exponents were determined for the average state diﬀerence of 20 randomly drawn circuits for each parameter pair). A spike in u at time t = 1 s was delayed by 0.5 ms. The contour lines indicate where μ crosses the values −1, 0, and 1. (d) Computational performance of these circuits (same as ﬁg. 6.5b), shown for comparison with panel c. Figure 6.6

diﬀerent circuits with parameters chosen from the 3 locations marked on the map in ﬁg. 6.6c. By determining the best-ﬁtting μ for ΔT = 1.5s for three diﬀerent values of δ0 (resulting from moving a spike at time t = 1s by 0.5, 1, 2 ms) one gets the dependence of this Lyapunov exponent on the circuit parameter λ shown in ﬁg. 6.6b (for values of λ and Wscale on a straight line between the points marked in the map of ﬁg. 6.6c). The middle curve in ﬁg. 6.6c shows for which values of λ and Wscale the Lyapunov exponent is estimated to have the value 0. By comparing it with those regions on this parameter map where the circuits have the largest computational power (for the classiﬁcation of noisy spike patterns, see ﬁg. 6.6d), one sees that this line runs through those regions which yield the largest computational power for these computations. We refer to Mayor and Gerstner (2005) for other recent work

148

What Makes a Dynamical System Computationally Powerful?

on studies of the relationship between the edge of chaos and the computational power of spiking neural circuit models. Although this estimated edge of chaos coincides quite well with points of best computational performance, it remains an unsatisfactory tool for predicting parameter regions with large computational power for three reasons: 1. Since the edge of chaos is a lower-dimensional manifold in a parameter map (in this case a curve in a 2D map), it cannot predict the (full dimensional) regions of a parameter map with high computational performance (e.g., the regions with light shading in ﬁg. 6.5b). 2. The edge of chaos does not provide intrinsic reasons why points of the parameter map yield small or large computational power. 3. It turns out that in some parameter maps diﬀerent regions provide circuits with large computational power for diﬀerent classes of computational tasks (as shown in Maass et al. (2005), for computations on spike patterns and for computations with ﬁring rates). But the edge of chaos can at best single out peaks for one of these regions. Hence it cannot possibly be used as a universal predictor of maximal computational power for all types of computational tasks. These three deﬁciencies suggest that one has to think about diﬀerent strategies to approach the central question of this chapter. The strategy we will pursue in the following is based on the assumption that the computational function of cortical microcircuits is not fully genetically encoded, but rather emerges through various forms of plasticity (“learning”) in response to the actual distribution of signals that the neural microcircuit receives from its environment. From this perspective the question about the computational function of cortical microcircuits C turns into the following questions: What functions (i.e., maps from circuit inputs to circuit outputs) can the circuit C learn to compute? How well can the circuit C generalize a speciﬁc learned computational function to new inputs? In the following, we propose quantitative criteria based on rigorous mathematical principles for evaluating a neural microcircuit C with regard to these two questions. We will compare in section 6.9.5 the predictions of these quantitative measures with the actual computational performance achieved by neural microcircuit models as discussed in section 6.9.1. 6.9.3

A Measure for the Kernel-Quality

One expects from a powerful computational system that signiﬁcantly diﬀerent input streams cause signiﬁcantly diﬀerent internal states and hence may lead to diﬀerent outputs. Most real-world computational tasks require that the circuit give a desired output not just for two, but for a fairly large number m of signiﬁcantly diﬀerent

6.9

Toward the Analysis of Biological Neural Systems

149

inputs. One could of course test whether a circuit C can separate each of the m 2 pairs of such inputs. But even if the circuit can do this, we do not know whether a neural readout from such circuit would be able to produce given target outputs for these m inputs. Therefore we propose here the linear separation property as a more suitable quantitative measure for evaluating the computational power of a neural microcircuit (or more precisely, the kernel quality of a circuit; see below). To evaluate the linear separation property of a circuit C for m diﬀerent inputs u1 , . . . , um (which are in the following always functions of time, i.e., input streams such as, for example, multiple spike trains) we compute the rank of the n × m matrix M whose columns are the circuit states xui (t0 ) that result at some ﬁxed time t0 for the preceding input stream ui . If this matrix has rank m, then it is guaranteed that any given assignment of target outputs yi ∈ R at time t0 for the inputs ui can be implemented by this circuit C (in combination with a linear readout). In particular, each of the 2m possible binary classiﬁcations of these m inputs can then be carried out by a linear readout from this ﬁxed circuit C. Obviously such insight is much more informative than a demonstration that some particular classiﬁcation task can be carried out by such circuit C. If the rank of this matrix M has a value r < m, then this value r can still be viewed as a measure for the computational power of this circuit C, since r is the number of “degrees of freedom” that a linear readout has in assigning target outputs yi to these inputs ui (in a way which can be made mathematically precise with concepts of linear algebra). Note that this rank measure for the linear separation property of a circuit C may be viewed as an empirical measure for its kernel quality, i.e., for the complexity and diversity of nonlinear operations carried out by C on its input stream in order to boost the classiﬁcation power of a subsequent linear decision hyperplane (see Vapnik, 1998). 6.9.4

A Measure for the Generalization-Capability

Obviously the preceding measure addresses only one component of the computational performance of a neural circuit C. Another component is its capability to generalize a learned computational function to new inputs. Mathematical criteria for generalization capability are derived by Vapnik (1998) (see ch. 4 in Cherkassky and Mulier, 1998, for a compact account of results relevant for our arguments). According to this mathematical theory one can quantify the generalization capability of any learning device in terms of the VC-dimension of the class H of hypotheses that are potentially used by that learning device.13 More precisely: if VC-dimension (H) is substantially smaller than the size of the training set Strain , one can prove that this learning device generalizes well, in the sense that the hypothesis (or input-output map) produced by this learning device is likely to have for new examples an error rate which is not much higher than its error rate on Strain , provided that the new examples are drawn from the same distribution as the training examples (see eq. 4.22 in Cherkassky and Mulier, 1998). We apply this mathematical framework to the class HC of all maps from a set

150

What Makes a Dynamical System Computationally Powerful?

Suniv of inputs u into {0, 1} that can be implemented by a circuit C. More precisely: HC consists of all maps from Suniv into {0, 1} that could possibly be implemented by a linear readout from circuit C with ﬁxed internal parameters (weights etc.) but arbitrary weights w ∈ Rn of the readout (which classiﬁes the circuit input u as belonging to class 1 if w · xu (t0 ) ≥ 0, and to class 0 if w · xu (t0 ) < 0). Whereas it is very diﬃcult to achieve tight theoretical bounds for the VCdimension of even much simpler neural circuits (see Bartlett and Maass, 2003), one can eﬃciently estimate the VC-dimension of the class HC that arises in our context for some ﬁnite ensemble Suniv of inputs (that contains all examples used for training or testing) by using the following mathematical result (which can be proved with the help of Radon’s theorem): Theorem 6.1 Let r be the rank of the n × s matrix consisting of the s vectors xu (t0 ) for all inputs u in Suniv (we assume that Suniv is ﬁnite and contains s inputs). Then r ≤ VC-dimension(HC ) ≤ r + 1. Proof Idea. Fix some inputs u1 , . . . , ur in Suniv so that the resulting r circuit states xui (t0 ) are linearly independent. The ﬁrst inequality is obvious since this set of r linearly independent vectors can be shattered by linear readouts from the circuit C. To prove the second inequality one assumes for a contradiction that there exists a set v1 , . . . , vr+2 of r+2 inputs in Suniv so that the corresponding set of r+2 circuit states xvi (t0 ) can be shattered by linear readouts. This set M of r +2 vectors is contained in the r-dimensional space spanned by the linearly independent vectors xu1 (t0 ), . . . , xur (t0 ). Therefore Radon’s theorem implies that M can be partitioned into disjoint subsets M1 , M2 whose convex hulls intersect. Since these sets M1 , M2 cannot be separated by a hyperplane, it is clear that no linear readout exists that assigns value 1 to points in M1 and value 0 to points in M2 . Hence M = M1 ∪ M2 is not shattered by linear readouts, a contradiction to our assumption. We propose to use the rank r deﬁned in theorem 6.1 as an estimate of VCdimension(HC ), and hence as a measure that informs us about the generalization capability of a neural microcircuit C. It is assumed here that the set Suniv contains many noisy variations of the same input signal, since otherwise learning with a randomly drawn training set Strain ⊆ Suniv has no chance to generalize to new noisy variations. Note that each family of computational tasks induces a particular notion of what aspects of the input are viewed as noise, and what input features are viewed as signals that carry information which is relevant for the target output for at least one of these computational tasks. For example, for computations on spike patterns some small jitter in the spike timing is viewed as noise. For computations on ﬁring rates even the sequence of interspike intervals and the temporal relations between spikes that arrive from diﬀerent input sources are viewed as noise, as long as these input spike trains represent the same ﬁring rates. An example for the former computational task was discussed in section 6.9.1. This task was to output at time t = 200 ms the class (0 or 1) of the spike pattern

Toward the Analysis of Biological Neural Systems

a

8

151

0.2

b

0.03

4

0.02

2

0.16

1 0.14

0.1 0.05 0.5

450

1 1.4 2 λ

3 4

6 8

SD

2 scale

3

0.025

0.18

W

1 0.7 0.5 0.3

8 4

2 Wscale

6.9

1 0.7 0.5 0.3

350

2

300

1

0

SD

250 200

0.1 0.05 0.5

20

400

3

1 1.4 2 λ

3 4

6 8

Measuring the generalization capability of neural microcircuit models. (a) Test error minus train error (error was measured as the fraction of examples that were misclassiﬁed) in the spike pattern classiﬁcation task discussed in section 6.9.1 for 90 diﬀerent types of neural microcircuits (as in ﬁg. 6.5b). The standard deviation is shown in the inset on the upper right. (b) Generalization capability for spike patterns: estimated VC-dimension of HC (for a set Suniv of inputs u consisting of 500 jittered versions of 4 spike patterns), for 90 diﬀerent circuit types (average over 20 circuits; for each circuit, the average over 5 diﬀerent sets of spike patterns was used). The standard deviation is shown in the inset on the upper right. See section 6.9.5 for details.

Figure 6.7

from which the preceding circuit input had been generated (for some arbitrary partition of the 80 ﬁxed spike patterns into two classes; see section 6.9.1). For a poorly generalizing network, the diﬀerence between train and test error is large. One would suppose that this diﬀerence becomes large as the network dynamics become more and more chaotic. This is indeed the case; see ﬁg. 6.7a. The transition is is pretty well predicted by the estimated VC-dimension of HC ; see ﬁg. 6.7b. 6.9.5

Evaluating the Inﬂuence of Synaptic Connectivity on Computational Performance

We now test the predictive quality of the two proposed measures for the computational power of a microcircuit on spike patterns. One should keep in mind that the proposed measures do not attempt to test the computational capability of a circuit for one particular computational task, but rather for any distribution on Suniv and for a very large (in general, inﬁnitely large) family of computational tasks that have in common only a particular bias regarding which aspects of the incoming spike trains may carry information that is relevant for the target output of computations, and which aspects should be viewed as noise. ﬁg. 6.8a explains why the lower left part of the parameter map in ﬁg. 6.5b is less suitable for any such computation, since there the kernel quality of the circuits is too low.14 Figure 6.8b explains why the upper right part of the parameter map in ﬁg. 6.5b is less suitable, since a higher VC-dimension (for a training set of ﬁxed size) entails poorer generalization capability. We are not aware of a theoretically founded way of combining both measures into a single value that predicts overall computational

152

What Makes a Dynamical System Computationally Powerful?

a

b

8

scale

8

450

4

450

4

2

400

2

400

2

350

1 0.7 0.5 0.3

350

1 0.7 0.5 0.3

1 0.7 0.5 0.3

W

c

8

4

300

300 250

250 0.1 0.05 0.5

200 1 1.4 2 λ

3 4

6 8

200

0.1 0.05 0.5

1 1.4 2 λ

3 4

6 8

10

20

5

3

15

2

10

SD

0

1 5

0.1 0.05 0.5

0 1 1.4 2 λ

3 4

6 8

Values of the proposed measures for computations on spike patterns. (a) Kernel quality for spike patterns of 90 diﬀerent circuit types (average over 20 circuits, mean SD = 13). (b) Generalization capability for spike patterns: estimated VC-dimension of HC (for a set Suniv of inputs u consisting of 500 jittered versions of 4 spike patterns), for 90 diﬀerent circuit types (same as ﬁg. 6.7b). (c) Diﬀerence of both measures (the standard deviation is shown in the inset on the upper right). This should be compared with actual computational performance plotted in ﬁg. 6.5b.

Figure 6.8

performance. But if one just takes the diﬀerence of both measures (after scaling each linearly into a common range [0,1]), then the resulting number (see ﬁg. 6.8c) predicts quite well which types of neural microcircuit models perform well for the particular computational tasks considered in Figure 6.5b.15 Results of further tests of the predictive power of these measures are reported in Maass et al. (2005). These tests have been applied there to a completely diﬀerent parameter map, and to diverse classes of computational tasks.

6.10

Conclusions The need to understand computational properties of complex dynamical systems is becoming more urgent. New experimental methods provide substantial insight into the inherent dynamics of the computationally most powerful classes of dynamical systems that are known: neural systems and gene regulation networks of biological organisms. More recent experimental data show that simplistic models for computations in such systems are not adequate, and that new concepts and methods have to be developed in order to understand their computational function. This short review has shown that several old ideas regarding computations in dynamical systems receive new relevance in this context, once they are transposed into a more realistic conceptual framework that allows us to analyze also online computations on continuous input streams. Another new ingredient is the investigation of the temporal evolution of information in a dynamical system from the perspective of models for the (biological) user of such information, i.e., from the perspective of neurons that receive inputs from several thousand presynaptic neurons in a neural circuit, and from the perspective of gene regulation mechanisms that involve thousands of transcription factors. Empirical evidence from the area of machine learning

NOTES

153

supports the hypothesis that readouts of this type, which are able to sample not just two or three, but thousands of coordinates of the state vector of a dynamical system, impose diﬀerent (and in general, less obvious) constraints on the dynamics of a high-dimensional dynamical system in order to employ such system for complex computations on continuous input streams. One might conjecture that unsupervised learning and regulation processes in neural systems adapt the system dynamics in such a way that these constraints are met. Hence, suitable variations of the idea of self-organized criticality may help us to gain a system-level perspective of synaptic plasticity and other adaptive processes in neural systems.

Notes

1 For the sake of completeness, we give here the deﬁnition of an attractor according to Strogatz (1994): He deﬁnes an attractor to be a closed set A with the following properties: (1) A is an invariant set: any trajectory x(t) that starts in A stays in A for all time. (2) A attracts an open set of initial conditions: there is an open set U containing A such that if x(t) ∈ U , then the distance from x(t) to A tends to zero as t → ∞. The largest such U is called the basin of attraction of A. (3) A is minimal: there is no proper subset of A that satisﬁes conditions 1 and 2. 2 In Kauﬀman (1969), the inactive state of a variable is denoted by 0. We use −1 here for reasons of notational consistency. 3 Here, x potentially depends on all other variables x , . . . , x . The function f can always be 1 i i N restricted such that xi is determined by the inputs to elements i only. 4 In Kauﬀman (1993), a state cycle is also called an attractor. Because such state cycles can be unstable to most minimal perturbations, we will avoid the term attractor here. 5 Wolfram considered automata with a neighborhood of ﬁve cells in total and two possible cell states. Since he considered “totalistic” transfer functions only (i.e., the function depends on the sum of the neighborhood states only), the number of possible transfer functions was small. Hence, the behavior of all such automata could be studied. 6 In the case of two-state cellular automata, high λ values imply that most state transitions map to the single nonquiescent state that leads to ordered dynamics. The most heterogeneous rules are found at λ = 0.5. 7 The model in Maass et al. (2002) was introduced in the context of biologically inspired neural microcircuits. The network consisted of spiking neurons. In Jaeger (2002), the network consisted of sigmoidal neurons. 8 The delayed 3-bit parity of an input signal u(·) is given by P ARIT Y (u(t−τ ), u(t−τ −1), u(t− τ − 2)) for delays τ > 0. The function P ARIT Y outputs 1 if the number of inputs which assume the value u ¯ + 1 is odd and −1 otherwise. 9 This assumption is best justiﬁed if such readout neuron is located, for example, in another brain area that receives massive input from many neurons in this microcircuit and only has diﬀuse backward projection. But it is certainly problematic and should be addressed in future elaborations of the present approach. 10 One can be even more realistic and ﬁlter it also by a model for the short-term dynamics of the synapse into the readout neuron, but this turns out to make no diﬀerence for the analysis proposed in this chapter. dVm 11 Membrane voltage V m modeled by τm dt = −(Vm −Vresting )+Rm ·(Isyn (t)+Ibackground + Inoise ), where τm = 30 ms is the membrane time constant, Isyn models synaptic inputs from other neurons in the circuits, Ibackground models a constant unspeciﬁc background input, and Inoise models noise in the input. The membrane resistance Rm was chosen as 1M Ω. 12 Short-term synaptic dynamics was modeled according to Markram et al. (1998), with distributions of synaptic parameters U (initial release probability), D (time constant for depression), F (time constant for facilitation) chosen to reﬂect empirical data (see Maass et al., 2002, for details). 13 The VC-dimension (of a class H of maps H from some universe S univ of inputs into {0, 1}) is deﬁned as the size of the largest subset S ⊆ Suniv which can be shattered by H. One says that S ⊆ Suniv is shattered by H if for every map f : S → {0, 1} there exists a map H in H such that

154

NOTES H(u) = f (u) for all u ∈ S (this means that every possible binary classiﬁcation of the inputs u ∈ S can be carried out by some hypothesis H in H). 14 The rank of the matrix consisting of 500 circuit states x (t) for t = 200 ms was computed u for 500 spike patterns over 200 ms as described in section 6.9.3; see ﬁg. 6.5a. For each circuit, the average over ﬁve diﬀerent sets of spike patterns was used. 15 Similar results arise if one records the analog values of the circuit states with a limited precision of, say, 1%.

7

A Variational Principle for Graphical Models

Martin J. Wainwright and Michael I. Jordan

Graphical models bring together graph theory and probability theory in a powerful formalism for multivariate statistical modeling. In statistical signal processing— as well as in related ﬁelds such as communication theory, control theory, and bioinformatics—statistical models have long been formulated in terms of graphs, and algorithms for computing basic statistical quantities such as likelihoods and marginal probabilities have often been expressed in terms of recursions operating on these graphs. Examples include hidden Markov models, Markov random ﬁelds, the forward-backward algorithm, and Kalman ﬁltering (Kailath et al., 2000; Pearl, 1988; Rabiner and Juang, 1993). These ideas can be understood, uniﬁed, and generalized within the formalism of graphical models. Indeed, graphical models provide a natural framework for formulating variations on these classical architectures, and for exploring entirely new families of statistical models. The recursive algorithms cited above are all instances of a general recursive algorithm known as the junction tree algorithm (Lauritzen and Spiegelhalter, 1988). The junction tree algorithm takes advantage of factorization properties of the joint probability distribution that are encoded by the pattern of missing edges in a graphical model. For suitably sparse graphs, the junction tree algorithm provides a systematic and practical solution to the general problem of computing likelihoods and other statistical quantities associated with a graphical model. Unfortunately, many graphical models of practical interest are not “suitably sparse,” so that the junction tree algorithm no longer provides a viable computational solution to the problem of computing marginal probabilities and other expectations. One popular source of methods for attempting to cope with such cases is the Markov chain Monte Carlo (MCMC) framework, and indeed there is a signiﬁcant literature on the application of MCMC methods to graphical models (Besag and Green, 1993; Gilks et al., 1996). However, MCMC methods can be overly slow for practical applications in ﬁelds such as signal processing, and there has been signiﬁcant interest in developing faster approximation techniques. The class of variational methods provides an alternative approach to computing approximate marginal probabilities and expectations in graphical models. Roughly

156

A Variational Principle for Graphical Models

speaking, a variational method is based on casting a quantity of interest (e.g., a likelihood) as the solution to an optimization problem, and then solving a perturbed version of this optimization problem. Examples of variational methods for computing approximate marginal probabilities and expectations include the “loopy” form of the belief propagation or sum-product algorithm (McEliece et al., 1998; Yedidia et al., 2001) as well as a variety of so-called mean ﬁeld algorithms (Jordan et al., 1999; Zhang, 1996). Our principal goal in this chapter is to give a mathematically precise and computationally oriented meaning to the term variational in the setting of graphical models—a meaning that reposes on basic concepts in the ﬁeld of convex analysis (Rockafellar, 1970). Compared to the somewhat loose deﬁnition of variational that is often encountered in the graphical models literature, our characterization has certain advantages, both in clarifying the relationships among existing algorithms, and in permitting fuller exploitation of the general tools of convex optimization in the design and analysis of new algorithms. Brieﬂy, the core issues can be summarized as follows. In order to deﬁne an optimization problem, it is necessary to specify both a cost function to be optimized, and a constraint set over which the optimization takes place. Reﬂecting the origins of most existing variational methods in statistical physics, developers of variational methods generally express the function to be optimized as a “free energy,” meaning a functional on probability distributions. The set to be optimized over is often left implicit, but it is generally taken to be the set of all probability distributions. A basic exercise in constrained optimization yields the Boltzmann distribution as the general form of the solution. While useful, this derivation has two shortcomings. First, the optimizing argument is a joint probability distribution, not a set of marginal probabilities or expectations. Thus, the derivation leaves us short of our goal of a variational representation for computing marginal probabilities. Second, the set of all probability distributions is a very large set, and formulating the optimization problem in terms of such a set provides little guidance in the design of computationally eﬃcient approximations. Our approach addresses both of these issues. The key insight is to formulate the optimization problem not over the set of all probability distributions, but rather over a ﬁnite-dimensional set M of realizable mean parameters. This set is convex in general, and it is a polytope in the case of discrete random variables. There are several natural ways to approximate this convex set, and a broad range of extant algorithms turn out to involve particular choices of approximations. In particular, as we will show, the “loopy” form of the sum-product or belief propagation algorithm involves an outer approximation to M, whereas the more classical mean ﬁeld algorithms, on the other hand, involve an inner approximation to the set M. The characterization of belief propagation as an optimization over an outer approximation of a certain convex set does not arise readily within the standard formulation of variational methods. Indeed, given an optimization over all possible probability distributions, it is diﬃcult to see how to move “outside” of such a set. Similarly, while the standard formulation does provide some insight into the diﬀerences between belief propagation and mean ﬁeld methods (in that

7.1

Background

157

they optimize diﬀerent “free energies”), the standard formulation does not involve the set M, and hence does not reveal the fundamental diﬀerence in terms of outer versus inner approximations. The core of the chapter is a variational characterization of the problem solved by the junction tree algorithm—that of computing exact marginal probabilities and expectations associated with subsets of nodes in a graphical model. These probabilities are obtained as the maximizing arguments of an optimization over the set M. Perhaps surprisingly, this problem is a convex optimization problem for a broad class of graphical models. With this characterization in hand, we show how variational methods arise as “relaxations”—that is, simpliﬁed optimization problems that involve some approximation of the constraint set, the cost function, or both. We show how a variety of standard variational methods, ranging from classical mean-ﬁeld to cluster variational methods, ﬁt within this framework. We also discuss new methods that emerge from this framework, including a relaxation based on semideﬁnite constraints and a link between reweighted forms of the maxproduct algorithm and linear programming. The remainder of the chapter is organized as follows. The ﬁrst two sections are devoted to basics: section 7.1 provides an overview of graphical models and section 7.2 is devoted to a brief discussion of exponential families. In section 7.3, we develop a general variational representation for computing marginal probabilities and expectations in exponential families. section 7.4 illustrates how various exact methods can be understood from this perspective. The rest of the chapter— sections 7.5 through 7.7—is devoted to the exploration of various relaxations of this exact variational principle, which in turn yield various algorithms for computing approximations to marginal probabilities and other expectations.

7.1

Background 7.1.1

Graphical Models

A graphical model consists of a collection of probability distributions that factorize according to the structure of an underlying graph. A graph G = (V, E) is formed by a collection of vertices V and a collection of edges E. An edge consists of a pair of vertices, and may either be directed or undirected. Associated with each vertex s ∈ V is a random variable xs taking values in some set Xs , which may either be continuous (e.g., Xs = R) or discrete (e.g., Xs = {0, 1, . . . , m − 1}). For any subset A of the vertex set V , we deﬁne xA := {xs | s ∈ A}. Directed Graphical Models In the directed case, each edge is directed from parent to child. We let π(s) denote the set of all parents of given node s ∈ V . (If s has no parents, then the set π(s) should be understood to be empty.) With this notation,

158

A Variational Principle for Graphical Models

a directed graphical model consists of a collection of probability distributions that factorize in the following way: p(xs | xπ(s) ). (7.1) p(x) = s∈V

It can be veriﬁed that our use of notation is consistent, in that p(xs | xπ(s) ) is, in fact, the conditional distribution for the global distribution p(x) thus deﬁned. Undirected Graphical Models In the undirected case, the probability distribution factorizes according to functions deﬁned on the cliques of the graph (i.e., fully connected subsets of V ). In particular, associated with each clique C is a compatibility function ψC : X n → R+ that depends only on the subvector xC . With this notation, an undirected graphical model (also known as a Markov random ﬁeld) consists of a collection of distributions that factorize as p(x) =

1 ψC (xC ), Z

(7.2)

C

where the product is taken over all cliques of the graph. The quantity Z is a constant chosen to ensure that the distribution is normalized. In contrast to the directed case 7.1, in general the compatibility functions ψC need not have any obvious or direct relation to local marginal distributions. Families of probability distributions as deﬁned as in equation 7.1 or 7.2 also have a characterization in terms of conditional independencies among subsets of random variables. We will not use this characterization in this chapter, but refer the interested reader to Lauritzen (1996) for a full treatment. 7.1.2

Inference Problems and Exact Algorithms

Given a probability distribution p(·) deﬁned by a graphical model, our focus will be solving one or more of the following inference problems: 1. computing the likelihood; 2. computing the marginal distribution p(xA ) over a particular subset A ⊂ V of nodes; 3. computing the conditional distribution p(xA | xB ), for disjoint subsets A and B, where A ∪ B is in general a proper subset of V ; in the set arg maxx∈X n p(x)). 4. computing a mode of the density (i.e., an element x Problem 1 is a special case of problem 2, because the likelihood is the marginal probability of the observed data. The computation of a conditional probability in problem 3 is similar in that it also requires marginalization steps, an initial one to obtain the numerator p(xA , xB ), and a further step to obtain the denominator p(xB ). In contrast, the problem of computing modes stated in problem 4 is fundamentally diﬀerent, since it entails maximization rather than integration. Although problem 4 is not the main focus of this chapter, there are important connections

7.1

Background

159

between the problem of computing marginals and that of computing modes; these are discussed in section 7.7.2. To understand the challenges inherent in these inference problems, consider the case of a discrete random vector x ∈ X n , where Xs = {0, 1, . . . , m − 1} for each vertex s ∈ V . A naive approach to computing a marginal at a single node—say p(xs )—entails summing over all conﬁgurations of the form {x | xs = xs }. Since this set has mn−1 elements, it is clear that a brute-force approach will rapidly become intractable as n grows. Similarly, computing a mode entails solving an integer programming problem over an exponential number of conﬁgurations. For continuous random vectors, the problems are no easier1 and typically harder, since they require computing a large number of integrals. Both directed and undirected graphical models involve factorized expressions for joint probabilities, and it should come as no surprise that exact inference algorithms treat them in an essentially identical manner. Indeed, to permit a simple uniﬁed treatment of inference algorithms, it is convenient to convert directed models to undirected models and to work exclusively within the undirected formalism. Any directed graph can be converted, via a process known as moralization (Lauritzen and Spiegelhalter, 1988), to an undirected graph that—at least for the purposes of solving inference problems—is equivalent. Throughout the rest of the chapter, we assume that this transformation has been carried out. Message-passing on trees For graphs without cycles—also known as trees— these inference problems can be solved exactly by recursive “message-passing” algorithms of a dynamic programming nature, with a computational complexity that scales only linearly in the number of nodes. In particular, for the case of computing marginals, the dynamic programming solution takes the form of a general algorithm known as the sum-product algorithm, whereas for the problem of computing modes it takes the form of an analogous algorithm known as the maxproduct algorithm. Here we provide a brief description of these algorithms; further details can be found in various sources (Aji and McEliece, 2000; Kschischang and Frey, 1998; Lauritzen and Spiegelhalter, 1988; Loeliger, 2004). We begin by observing that the cliques of a tree-structured graph T = (V, E(T )) are simply the individual nodes and edges. As a consequence, any treestructured graphical model has the following factorization: 1 ψs (xs ) ψst (xs , xt ). (7.3) p(x) = Z s∈V

(s,t)∈E(T )

Here we describe how the sum-product algorithm computes the marginal distribu& tion μs (xs ) := {x | x =xs } p(x) for every node of a tree-structured graph. We will s focus in detail on the case of discrete random variables, with the understanding that the computations carry over (at least in principle) to the continuous case by replacing sums with integrals. Sum-product algorithm The essential principle underlying the sum-product algorithm on trees is divide and conquer: we solve a large problem by breaking it

160

A Variational Principle for Graphical Models

down into a sequence of simpler problems. The tree itself provides a natural way to break down the problem as follows. For an arbitrary s ∈ V , consider the set of its neighbors N (s) = {u ∈ V | (s, u) ∈ E}. For each u ∈ N (s), let Tu = (Vu , Eu ) be the subgraph formed by the set of nodes (and edges joining them) that can be reached from u by paths that do not pass through node s. The key property of a tree is that each such subgraph Tu is again a tree, and Tu and Tv are disjoint for u = v. In this way, each vertex u ∈ N (s) can be viewed as the root of a subtree Tu , as illustrated in ﬁg. 7.1a. For each subtree Tt , we deﬁne xVt := {xu | u ∈ Vt }. Now consider the collection of terms in equation 7.3 associated with vertices or edges in Tt : collecting all of these terms yields a subproblem p(xVt ; Tt ) for this subtree. Now the conditional independence properties of a tree allow the computation of the marginal at node μs to be broken down into a product of the form ∗ μs (xs ) ∝ ψs (xs ) Mts (xs ). (7.4) t∈N (s) ∗ Each term Mts (xs ) in this product is the result of performing a partial summation for the subproblem p(xVt ; Tt ) in the following way: ∗ Mts (xs ) = ψst (xs , xt ) p(xTt ; Tt ). (7.5) {xT | xs =xs } t

∗ (xs ) is again a tree-structured summation, For ﬁxed xs , the subproblem deﬁning Mts albeit involving a subtree Tt smaller than the original tree T . Therefore, it too can be broken down recursively in a similar fashion. In this way, the marginal at node s can be computed by a series of recursive updates. Rather than applying the procedure described above to each node separately, the sum-product algorithm computes the marginals for all nodes simultaneously and in parallel. At each iteration, each node t passes a “message” to each of its neighbors u ∈ N (t). This message, which we denote by Mtu (xu ), is a function of the possible states xu ∈ Xu (i.e., a vector of length |Xu | for discrete random variables). On the full graph, there are a total of 2|E| messages, one for each direction of each edge. This full collection of messages is updated, typically in parallel, according to the following recursion: 3 2 Mut (xt ) , (7.6) ψst (xs , xt ) ψt (xt ) Mts (xs ) ← κ xt

u∈N (t)/s

where κ > 0 is a normalization constant. It can be shown (Pearl, 1988) that for tree-structured graphs, iterates generated by the update 7.6 will converge to ∗ ∗ , Mts , (s, t) ∈ E} after a ﬁnite number of a unique ﬁxed point M ∗ = {Mst ∗ iterations. Moreover, component Mts of this ﬁxed point is precisely equal, up to a normalization constant, to the subproblem deﬁned in equation 7.5, which justiﬁes our abuse of notation post hoc. Since the ﬁxed point M ∗ speciﬁes the solution to all of the subproblems, the marginal μs at every node s ∈ V can be computed easily via equation 7.4.

7.1

Background

161

Max-product algorithm Suppose that the summation in the update 7.6 is replaced by a maximization. The resulting max-product algorithm solves the problem of ﬁnding a mode of a tree-structured distribution p(x). In this sense, it represents a generalization of the Viterbi algorithm (Forney, 1973) from chains to arbitrary tree-structured graphs. More speciﬁcally, the max-product updates will converge to another unique ﬁxed point M ∗ —distinct, of course, from the sumproduct ﬁxed point. This ﬁxed point can be used to compute the max-marginal νs (xs ) := max{x | xs =xs } p(x ) at each node of the graph, in an analogous way to the computation of ordinary sum-marginals. Given these max-marginals, it is ∈ arg maxx p(x) of the distribution (Dawid, straightforward to compute a mode x 1992; Wainwright et al., 2004). More generally, updates of this form apply to arbitrary commutative semirings on tree-structured graphs (Aji and McEliece, 2000; Dawid, 1992). The pairs “sum-product” and “max-product” are two particular examples of such an algebraic structure. Junction Tree Representation We have seen that inference problems on trees can be solved exactly by recursive message-passing algorithms. Given a graph with cycles, a natural idea is to cluster its nodes so as to form a clique tree—that is, an acyclic graph whose nodes are formed by cliques of G. Having done so, it is tempting to simply apply a standard algorithm for inference on trees. However, the clique tree must satisfy an additional restriction so as to ensure consistency of these computations. In particular, since a given vertex s ∈ V may appear in multiple cliques (say C1 and C2 ), what is required is a mechanism for enforcing consistency among the diﬀerent appearances of the random variable xs . In order to enforce consistency, it turns out to be necessary to restrict attention to those clique trees that satisfy a particular graph-theoretic property. In particular, we say that a clique tree satisﬁes the running intersection property if for any two clique nodes C1 and C2 , all nodes on the unique path joining them contain the intersection C1 ∩ C2 . Any clique tree with this property is known as a junction tree. For what type of graphs can one build junction trees? An important result in graph theory asserts that a graph G has a junction tree if and only if it is triangulated.2 This result underlies the junction tree algorithm (Lauritzen and Spiegelhalter, 1988) for exact inference on arbitrary graphs, which consists of the following three steps: Step 1: Given a graph with cycles G, triangulate it by adding edges as necessary. Step 2: Form a junction tree associated with the triangulated graph. Step 3: Run a tree inference algorithm on the junction tree. We illustrate these basic steps with an example. Example 7.1 Consider the 3 × 3 grid shown in the top panel of ﬁg. 7.1b. The ﬁrst step is to form a triangulated version, as shown in the bottom panel of ﬁg. 7.1b. Note that the graph would not be triangulated if the additional edge joining nodes 2 and 8

162

A Variational Principle for Graphical Models

#treegrt# #t#

#u# #treegru#

1

2

3

4

5

6

1 2 4

2 3 6

#w# #treegrw

2 6

2 4

#s#

2 4 5 8

#v# #treegrv#

7 1

2

4

5

7

(a)

8

8

(b)

9 3 6

2 5 8

2 5 6 8

4 8

6 8

4 7 8

6 8 9

9

(c)

(a): Decomposition of a tree, rooted at node s, into subtrees. Each neighbor (e.g., u) of node s is the root of a subtree (e.g., Tu ). Subtrees Tu and Tv , for t = u, are disconnected when node s is removed from the graph. (b), (c) Illustration of junction tree construction. Top panel in (b) shows original graph: a 3×3 grid. Bottom panel in (b) shows triangulated version of original graph. Note the two 4-cliques in the middle. (c) Corresponding junction tree for triangulated graph in (b), with maximal cliques depicted within ellipses. The rectangles are separator sets; these are intersections of neighboring cliques. Figure 7.1

were not present. Without this edge, the 4-cycle (2 − 4 − 8 − 6 − 2) would lack a chord. Figure 7.1c shows a junction tree associated with this triangulated graph, in which circles represent maximal cliques (i.e., fully connected subsets of nodes that cannot be augmented with an additional node and remain fully connected), and boxes represent separator sets (intersections of cliques adjacent in the junction tree). ♦ An important by-product of the junction tree construction is an alternative representation of the probability distribution deﬁned by a graphical model. Let C denote the set of all maximal cliques in the triangulated graph, and deﬁne S as the set of all separator sets in the junction tree. For each separator set S ∈ S, let d(S) denote the number of maximal cliques to which it is adjacent. The junction tree framework guarantees that the distribution p(·) factorizes in the form μC (xC ) , (7.7) p(x) = C∈C d(S)−1 S∈S [μS (xS )] where μC and μS are the marginal distributions over the cliques and separator sets respectively. Observe that unlike the representation of equation 7.2, the decomposition of equation 7.7 is directly in terms of marginal distributions, and does not require a normalization constant (i.e., Z = 1).

7.1

Background

163

Example 7.2 Markov Chain Consider the Markov chain p(x1 , x2 , x3 ) = p(x1 ) p(x2 | x1 ) p(x3 | x2 ). The cliques in a graphical model representation are {1, 2} and {2, 3}, with separator {2}. Clearly the distribution cannot be written as the product of marginals involving only the cliques. However, if we include the separator, it can be factorized in terms of its 2 )p(x2 ,x3 ) . ♦ marginals—viz., p(x1 , x2 , x3 ) = p(x1 ,xp(x 2) To anticipate the development in the sequel, it is helpful to consider the following “inverse” perspective on the junction tree representation. Suppose that we are given a set of functions τC (xC ) and τS (xS ) associated with the cliques and separator sets in the junction tree. What conditions are necessary to ensure that these functions are valid marginals for some distribution? Suppose that the functions {τS , τC } are locally consistent in the following sense: τS (xS ) = 1 normalization (7.8a)

xS

τC (xC ) = τS (xS )

marginalization .

(7.8b)

{xC | xS =xS }

The essence of the junction tree theory described above is that such local consistency is both necessary and suﬃcient to ensure that these functions are valid marginals for some distribution. Finally, turning to the computational complexity of the junction tree algorithm, the computational cost grows exponentially in the size of the maximal clique in the junction tree. The size of the maximal clique over all possible triangulations of a graph deﬁnes an important graph-theoretic quantity known as the treewidth of the graph. Thus, the complexity of the junction tree algorithm is exponential in the treewidth. For certain classes of graphs, including chains and trees, the treewidth is small and the junction tree algorithm provides an eﬀective solution to inference problems. Such families include many well-known graphical model architectures, and the junction tree algorithm subsumes many classical recursive algorithms, including the forward-backward algorithms for hidden Markov models (Rabiner and Juang, 1993), the Kalman ﬁltering-smoothing algorithms for state-space models (Kailath et al., 2000), and the pruning and peeling algorithms from computational genetics (Felsenstein, 1981). On the other hand, there are many graphical models (e.g., grids) for which the treewidth is infeasibly large. Coping with such models requires leaving behind the junction tree framework, and turning to approximate inference algorithms. 7.1.3

Message-Passing Algorithms for Approximate Inference

In the remainder of the chapter, we present a general variational principle for graphical models that can be used to derive a class of techniques known as variational inference algorithms. To motivate our later development, we pause to give a high-level description of two variational inference algorithms, with the goal

164

A Variational Principle for Graphical Models

of highlighting their simple and intuitive nature. The ﬁrst variational algorithm that we consider is a so-called “loopy” form of the sum-product algorithm (also referred to as the belief propagation algorithm). Recall that the sum-product algorithm is designed as an exact method for trees; from a purely algorithmic point of view, however, there is nothing to prevent one from running the procedure on a graph with cycles. More speciﬁcally, the message updates 7.6 can be applied at a given node while ignoring the presence of cycles— essentially pretending that any given node is embedded in a tree. Intuitively, such an algorithm might be expected to work well if the graph is suitably “tree like,” such that the eﬀect of messages propagating around cycles is appropriately diminished. This algorithm is in fact widely used in various applications that involve signal processing, including image processing, computer vision, computational biology, and error-control coding. A second variational algorithm is the so-called naive mean ﬁeld algorithm. For concreteness, we describe it in application to a very special type of graphical model, known as the Ising model. The Ising model is a Markov random ﬁeld involving a binary random vector x ∈ {0, 1}n , in which pairs of adjacent nodes are coupled with a weight θst , and each node has an observation weight θs . (See examples 7.4 and 7.11 for a more detailed description of this model.) To motivate the mean ﬁeld updates, we consider the Gibbs sampler for this model, in which the basic update step is to choose a node s ∈ V randomly, and then to update the state of the associated random variable according to the conditional probability with neighboring states ﬁxed. More precisely, denoting by N (s) the neighbors of a node s ∈ V , and letting (p) xN (s) denote the state of the neighbors of s at iteration p, the Gibbs update for xs takes the following form: ' & (p) 1 if u ≤ {1 + exp[−(θs + t∈N (s) θst xt )]}−1 (p+1) , (7.9) = xs 0 otherwise where u is a sample from a uniform distribution U(0, 1). It is well known that this procedure generates a sequence of conﬁgurations that converge (in a stochastic sense) to a sample from the Ising model distribution. In a dense graph, such that the cardinality of N (s) is large, we might attempt to & (p) invoke a law of large numbers or some other concentration result for t∈N (s) θst xt . To the extent that such sums are concentrated, it might make sense to replace sample values with expectations, which motivates the following averaged version of equation 7.9: 2 3 0 1 −1 θst μt ) , (7.10) μs ← 1 + exp − (θs + t∈N (s)

in which μs denotes an estimate of the marginal probability p(xs = 1). Thus, rather than ﬂipping the random variable xs with a probability that depends on the state of its neighbors, we update a parameter μs using a deterministic function of the corresponding parameters {μt | t ∈ N (s)} at its neighbors. Equation 7.10

7.2

Graphical Models in Exponential Form

165

deﬁnes the naive mean ﬁeld algorithm for the Ising model, which can be viewed as a message-passing algorithm on the graph. At ﬁrst sight, message-passing algorithms of this nature might seem rather mysterious, and do raise some questions. Do the updates have ﬁxed points? Do the updates converge? What is the relation between the ﬁxed points and the exact quantities? The goal of the remainder of this chapter is to shed some light on such issues. Ultimately, we will see that a broad class of message-passing algorithms, including the mean ﬁeld updates, the sum-product and max-product algorithms, as well as various extensions of these methods can all be understood as solving either exact or approximate versions of a certain variational principle for graphical models.

7.2

Graphical Models in Exponential Form We begin by describing how many graphical models can be viewed as particular types of exponential families. Further background can be found in the books by Efron (1978) and Brown (1986). This exponential family representation is the foundation of our later development of the variational principle. 7.2.1

Maximum Entropy

One way in which to motivate exponential family representations of graphical models is through the principle of maximum entropy. The set up for this principle is as follows: given a collection of functions φα : X n → R, suppose that we have observed their expected values—that is, we have E[φα (x)] = μα 8

for all α ∈ I,

(7.11)

9

where μ = μα | α ∈ I is 8a real vector, 9 I is an index set, and d := |I| is the length of the vectors μ and φ := φα | α ∈ I . Our goal is use the observations to infer a full probability distribution. Let P denote the set of all probability distributions p over the random vector x. Since there are (in general) many distributions p ∈ P that are consistent with the observations 7.11, we need a principled method for choosing among them. The principle of maximum entropy is to choose the distribution pM E such that its entropy, deﬁned & as H(p) := − x∈X n p(x) log p(x), is maximized. More formally, the maximum entropy solution pM E is given by the following constrained optimization problem: pM E := arg max H(p) p∈P

subject to constraints 7.11.

(7.12)

One interpretation of this principle is as choosing the distribution with maximal uncertainty while remaining faithful to the data. Presuming that problem 7.12 is feasible, it is straightforward to show using a

166

A Variational Principle for Graphical Models

Lagrangian formulation that its optimal solution takes the form 8 9 p(x; θ) ∝ exp θθ φα (x) ,

(7.13)

α∈I

which corresponds to a distribution in exponential form. Note that the exponential decomposition 7.13 is analogous to the product decomposition 7.2 considered earlier. In the language of exponential families, the vector θ ∈ R8d is known as 9 the canonical parameter, and the collection of functions φ = φα | α ∈ I are known as suﬃcient statistics. In the context of our current presentation, each canonical parameter θα has a very concrete interpretation as the Lagrange multiplier associated with the constraint E[φα (x)] = μα . 7.2.2

Exponential Families

We now deﬁne exponential families in more generality. Any exponential family consists of a particular class of densities taken with respect to a ﬁxed base measure ν. The base measure is typically counting measure (as in our discrete example above), or Lebesgue measure (e.g., for Gaussian families). Throughout this chapter, we use a, b to denote the ordinary Euclidean inner product between two vectors a and b of the same dimension. Thus, for each ﬁxed x ∈ X n , the quantity θ, φ(x) inner product in Rd of the two vectors θ ∈ Rd and 9 8 is the Euclidean φ(x) = φα (x) | α ∈ I . With this notation, the exponential family associated with φ consists of the following parameterized collection of density functions: 8 9 p(x; θ) = exp θ, φ(x) − A(θ) . (7.14) The quantity A, known as the log partition function or cumulant generating function, is deﬁned by the integral:

A(θ) = log expθ, φ(x) ν(dx). (7.15) Xn

Presuming that the integral is ﬁnite, this deﬁnition ensures that p(x; θ) is properly normalized (i.e., X n p(x; θ)ν(dx) = 1). With the set of potentials φ ﬁxed, each parameter vector θ indexes a particular member p(x; θ) of the family. The canonical parameters θ of interest belong to the set Θ := {θ ∈ Rd | A(θ) < ∞}.

(7.16)

Throughout this chapter, we deal exclusively with regular exponential families, for which the set Θ is assumed to be open. We summarize for future reference some well-known properties of A:

7.2

Graphical Models in Exponential Form

167

Lemma 7.1 The cumulant generating function A is convex in terms of θ. Moreover, it is inﬁnitely diﬀerentiable on Θ, and its derivatives correspond to cumulants. As an important special case, the ﬁrst derivatives of A take the form

∂A = φα (x)p(x; θ)ν(dx) = Eθ [φα (x)], ∂θα Xn

(7.17)

and deﬁne a vector μ := Eθ [φ(x)] of mean parameters associated with the exponential family. There are important relations between the canonical and mean parameters, and many inference problems can be formulated in terms of the mean parameters. These correspondences and other properties of the cumulant generating function are fundamental to our development of a variational principle for solving inference problems. 7.2.3

Illustrative Examples

In order to illustrate these deﬁnitions, we now discuss some particular classes of graphical models that commonly arise in signal and image processing problems, and how they can be represented in exponential form. In particular, we will see that graphical structure is reﬂected in the choice of suﬃcient statistics, or equivalently in terms of constraints on the canonical parameter vector. We begin with an important case—the Gaussian Markov random ﬁeld (MRF)— which is widely used for modeling various types of imagery and spatial data (Luettgen et al., 1994; Szeliski, 1990). Example 7.3 Gaussian Markov Random Field Consider a graph G = (V, E), such as that illustrated in ﬁg. 7.2(a), and suppose that each vertex s ∈ V has an associated Gaussian random variable xs . Any such scalar Gaussian is a (2-dimensional) exponential family speciﬁed 8 by suﬃcient 9 statistics xs and x2s . Turning to the Gaussian random vector x := xs | s ∈ V , it has an exponential family representation in terms of the suﬃcient statistics {xs , x2s | s ∈ V9} ∪ 8 8 {xs xt | (s, t)9 ∈ E}, with associated canonical parameters θs , θss | s ∈ V ∪ θst | (s, t) ∈ E . Here the additional cross-terms xs xt allow for possible correlation between components xs and xt of the Gaussian random vector. Note that there are a total of d = 2n + |E| suﬃcient statistics. The suﬃcient statistics and parameters can be represented compactly as (n + 1) × (n + 1) symmetric matrices: ⎤ ⎡ 0 θ1 θ2 . . . θn ⎥ ⎢ ⎢ θ1 θ11 θ12 . . . θ1n ⎥ ⎥ ⎢ 1 ⎥ ⎢ (7.18) X = U (θ) := ⎢ θ2 θ21 θ22 . . . θ2n ⎥ 1 x ⎢ x .. .. .. .. ⎥ ⎢ .. ⎥ . . . . ⎦ ⎣. θn θn1 θn2 . . . θnn

168

A Variational Principle for Graphical Models

Graph−structured matrix 1

#1#

2

#2#

3

4

#3# #5#

5

#4#

(a)

1

2

3

4

5

(b)

(a) A simple Gaussian model based on a graph G with 5 vertices. (b) The adjacency matrix of the graph G in (a), which speciﬁes the sparsity pattern of the matrix Z(θ).

Figure 7.2

We use Z(θ) to denote the lower n × n block of U (θ); it is known as the precision matrix. We say that x forms a Gaussian Markov random ﬁeld if its probability density function decomposes according to the graph G = (V, E). In terms of our canonical parameterization, this condition translates to the requirement that / E. Alternatively stated, the precision matrix Z(θ) must θst = 0 whenever (s, t) ∈ have the same zero-pattern as the adjacency matrix of the graph, as illustrated in ﬁg. 7.2b. For any two symmetric matrices C and D, it is convenient to deﬁne the inner product C, D := trace(C D). Using this notation leads to a particularly compact representation of a Gaussian MRF: 8 9 p(x; θ) = exp U (θ), X − A(θ) , (7.19) 0 1 where A(θ) := log Rn exp U (θ), X dx is the log cumulant generating function. The integral deﬁning A(θ) is ﬁnite only if the n×n precision matrix Z(θ) is negative deﬁnite, so that the domain of A has the form Θ = {θ ∈ Rd | Z(θ) ≺ 0}. Note that the mean parameters in the Gaussian model have a clear interpretation. The singleton elements μs = Eθ [xs ] are simply the Gaussian mean, whereas ♦ the elements μss = Eθ [x2s ] and μst = Eθ [xs xt ] are second-order moments. Markov random ﬁelds involving discrete random variables also arise in many applications, including image processing, bioinformatics, and error-control coding (Durbin et al., 1998; Geman and Geman, 1984; Kschischang et al., 2001; Loeliger, 2004). As with the Gaussian case, this class of Markov random ﬁelds also has a natural exponential representation. Example 7.4 Multinomial Markov Random Field Suppose that each xs is a multinomial random variable, taking values in the space Xs =8{0, 1, . . . , m 9 s − 1}. In order to represent a Markov random ﬁeld over the vector x = xs | s ∈ V in exponential form, we now introduce a particular set of suﬃcient statistics that will be useful in what follows. For each j ∈ Xs , let I j (xs ) be an

7.2

Graphical Models in Exponential Form

169

indicator function for the event {xs = j}. Similarly, for each pair (j, k) ∈ Xs × Xt , let I jk (xs , xt ) be an indicator for the event {(xs , xt ) = (j, k)}. These building blocks yield the following set of suﬃcient statistics: 9 8 9 8 I j (xs ) | s ∈ V, j ∈ Xs ∪ I j (xs )I k (xt ) | (s, t) ∈ E, (j, k) ∈ Xs × Xt . (7.20) The corresponding canonical parameter θ has elements of the form 8 9 8 9 θ = θs;j | s ∈ V, j ∈ Xs ∪ θst;jk | (s, t) ∈ E, (j, k) ∈ Xs × Xt .

(7.21)

It is convenient to combine the canonical parameters and indicator functions using & the shorthand notation θs (xs ) := j∈Xs θs;j I j (xs ); the quantity θst (xs , xt ) can be deﬁned similarly. With this notation, a multinomial MRF with pairwise interactions can be written in exponential form as 9 8 θs (xs ) + θst (xs , xt ) − A(θ) , (7.22) p(x; θ) = exp s∈V

(s,t)∈E

where the cumulant generating function is given by the summation 8 9 A(θ) := log exp θs (xs ) + θst (xs , xt ) . x∈X n

s∈V

(s,t)∈E

In signal-processing applications of these models, the random vector x is often viewed as hidden or partially observed (for instance, corresponding to the correct segmentation of an image). Thus, it is frequently the case that the functions θs are determined by noisy observations, whereas the terms θst control the coupling between variables xs and xt that are adjacent on the graph (e.g., reﬂecting spatial continuity assumptions). See ﬁg. 7.3(a) for an illustration of such a multinomial MRF deﬁned on a two-dimensional lattice, which is a widely used model in statistical image processing (Geman and Geman, 1984). In the special case that Xs = {0, 1} for all s ∈ V , the family represeted by equation 7.22 is known as the Ising model. Note that the mean parameters associated with this model correspond to particular marginal probabilities. For instance, the mean parameters associated with vertex s have the form μs;j = Eθ [I j (xs )] = p(xs = j; θ), and the mean parameters μst associated with edge (s, t) have an analogous interpretation as pairwise marginal values. ♦ Example 7.5 Hidden Markov Model A very important special case of the multinomial MRF is the hidden Markov model (HMM), which is a chain-structured graphical model widely used for the modeling of time series and other one-dimensional signals. It is conventional in the 9 8 HMM literature to refer to the multinomial random variables x = xs | s ∈ V as “state variables.” As illustrated in ﬁg. 7.3b, the edge set E deﬁnes a chain

170

A Variational Principle for Graphical Models

#cedge#

#cvar#

#xs#

#pmix#

#ecoup# #x1#

#x2#

#x3#

#x4#

#x5#

#pcond#

#esing#

#y1#

(a)

#y2#

#y3#

(b)

#y4#

#y5#

#zs# (c)

Figure 7.3 (a) A multinomial MRF on a 2D lattice model. (b) A hidden Markov model (HMM) is a special case of a multinomial MRF for a chain-structured graph. (c) The graphical representation of a scalar Gaussian mixture model: the multinomial xs indexes components in the mixture, and ys is conditionally Gaussian (with exponential parameters γs ) given the mixture component xs .

linking the state variables. The parameters θst (xs , xt ) deﬁne the state transition matrix; if this transition matrix is the same for all pairs s and t, then we have a homogeneous Markov chain. Associated with each multinomial state variable xs is a noisy observation ys , deﬁned by the conditional probability distribution p(ys |xs ). If we condition on the observed value of ys , this conditional probability is simply a function of xs , which we denote by θs (xs ). Given these deﬁnitions, equation 7.22 describes the conditional probability distribution p(x | y) for the HMM. In ﬁg. 7.3b, this conditioning is captured by shading the corresponding nodes in the graph. Note that the cumulant generating function A(θ) is, in fact, equal to the log likelihood of the observed data. ♦ Graphical models are not limited to cases in which the random variables at each node belong to the same exponential family. More generally, we can consider heterogeneous combinations of exponential family members. A very natural example, which combines the two previous types of graphical model, is that of a Gaussian mixture model. Such mixture models are widely used in modeling various classes of data, including natural images, speech signals, and ﬁnancial time series data; see the book by Titterington et al. (1986) for further background. Example 7.6 Mixture Model As shown in ﬁg. 7.3c, a scalar mixture model has a very simple graphical interpretation. In particular, let xs be a multinomial variable, taking values in Xs = {0, 1, 2, . . . , ms − 1}, speciﬁed in exponential parameter form with a function θs (xs ). The role of xs is to specify the choice of mixture component in the mixture model, so that our mixture model has ms components in total. We now let ys be conditionally Gaussian given xs , so that the conditional distribution p(ys | xs ; γs ) can be written in exponential family form with canonical parameters γs that are a function of xs . Overall, the pair (xs , ys ) form a very simple graphical model in exponential form, as shown in ﬁg. 7.3c.

7.3

An Exact Variational Principle for Inference

171

The pair (xs , ys ) serves a basic block for building more sophisticated graphical models. For example, one model is based on assuming that the mixture vector x is a multinomial MRF deﬁned on an underlying graph G = (V, E), whereas the components of y are conditionally independent given the mixture vector x. These assumptions lead to an exponential family p(y, x; θ, γ) of the form 1 p(ys | xs ; γs ) exp θs (xs ) + θst (xs , xt ) . (7.23) s∈V

s∈V

(s,t)∈E

For tree-structured graphs, Crouse et al. (1998) have applied this type of mixture model to applications in wavelet-based signal processing. ♦ This type of mixture model is a particular example of a broad class of graphical models that involve heterogeneous combinations of exponential family members (e.g., hierarchical Bayesian models).

7.3

An Exact Variational Principle for Inference With this set up, we can now rephrase inference problems in the language of exponential families. In particular, this chapter focuses primarily on the following two problems: computing the cumulant generating function A(θ), computing the vector of mean parameters μ := Eθ [φ(x)]. In Section 7.7.2 we discuss a closely related problem—namely, that of computing a mode of the distribution p(x; θ). The problem of computing the cumulant generating function arises in a variety of signal-processing problems, including likelihood ratio tests (for classiﬁcation and detection problems) and parameter estimation. The computation of mean parameters is also fundamental, and takes diﬀerent forms depending on the underlying graphical model. For instance, it corresponds to computing means and covariances in the Gaussian case, whereas for a multinomial MRF it corresponds to computing marginal distributions. The goal of this section is to show how both of these inference problems can be represented variationally—as the solution of an optimization problem. The variational principle that we develop, though related to the classical “free energy” approach of statistical physics (Yedidia et al., 2001), also has important diﬀerences. The classical principle yields a variational formulation for the cumulant generating function (or log partition function) in terms of optimizing over the space of all distributions. In our approach, on the other hand, the optimization is not deﬁned over all distributions—a very high or inﬁnite-dimensional space—but rather over the much lower dimensional space of mean parameters. As an important consequence,

172

A Variational Principle for Graphical Models

solving this variational principle yields not only generating function 9 8 the cumulant but also the full set of mean parameters μ = μα | α ∈ I . 7.3.1

Conjugate Duality

The cornerstone of our variational principle is the notion of conjugate duality. In this section, we provide a brief introduction to this concept, and refer the interested reader to the standard texts (Hiriart-Urruty and Lemar´echal, 1993; Rockafellar, 1970) for further details. As is standard in convex analysis, we consider extended real-valued functions, meaning that they take values in the extended real line R∗ := R ∪ {+∞}. Associated with any convex function f : Rd → R∗ is a conjugate dual function f ∗ : Rd → R∗ , which is deﬁned as follows: 8 9 f ∗ (y) := sup y, x − f (x) . (7.24) x∈Rd

This deﬁnition illustrates the concept of a variational deﬁnition: the function value f ∗ (y) is speciﬁed as the solution of an optimization problem parameterized by the vector y ∈ Rd . As illustrated in ﬁg. 7.4, the value f ∗ (y) has a natural geometric interpretation as the (negative) intercept of the hyperplane with normal (y, −1) that supports the epigraph of f . In particular, consider the family of hyperplanes of the form y, x−c, where y is a ﬁxed normal direction and c ∈ R is the intercept to be adjusted. Our goal is to ﬁnd the smallest c such that the resulting hyperplane supports the

#f# #u#

#linea# #epif#

#lineb# #ca#

#x# #cb#

Interpretation of conjugate duality in terms of supporting hyperplanes to the epigraph of f , deﬁned as epi(f ) := {(x, y) ∈ Rd × R | f (x) ≤ y}. The dual function is obtained by translating the family of hyperplane with normal y and intercept −c until it just supports the epigraph of f (the shaded region).

Figure 7.4

epigraph of f . Note that the hyperplane y, x − c lies below the epigraph of f if and only if the inequality y, x − c ≤ f (x) holds for all x ∈ Rd . Moreover,

7.3

An Exact Variational Principle for Inference

173

it can be seen8 that the smallest c for which this inequality is valid is given by 9 c∗ = supx∈Rd y, x − f (x) , which is precisely the value of the dual function. As illustrated in ﬁg. 7.4, the geometric interpretation is that of moving the hyperplane (by adjusting the intercept c) until it is just tangent to the epigraph of f . For convex functions meeting certain technical conditions, taking the dual twice recovers the original function. In analytical terms, this fact means that we can generate a variational representation for convex f in terms of its dual function as follows: 8 9 (7.25) f (x) = sup x, y − f ∗ (y) . y∈Rd

Our goal in the next few sections is to apply conjugacy to the cumulant generating function A associated with an exponential family, as deﬁned in equation 7.15. More speciﬁcally, its dual function takes the form A∗ (μ) := sup {θ, μ − A(θ)},

(7.26)

θ∈Θ

where we have used the fact that, by deﬁnition, the function value A(θ) is ﬁnite only if θ ∈ Θ. Here μ ∈ Rd is a vector of so-called dual variables of the same dimension as θ. Our choice of notation—using μ for the dual variables—is deliberately suggestive: as we will see momentarily, these dual variables turn out to be precisely the mean parameters deﬁned in equation 7.17. Example 7.7 To illustrate the computation of a dual function, consider a scalar Bernoulli random variable x ∈ {0, 1}, whose distribution can be written in the exponential family form as p(x; θ) = exp{θx − A(θ)}. The cumulant generating function is given by A(θ) = log[1 + exp(θ)], and there is a single dual variable μ = Eθ [x]. Thus, the variational problem 7.26 deﬁning A∗ takes the form 8 9 (7.27) A∗ (μ) = sup θμ − log[1 + exp(θ)] . θ∈R

If μ ∈ (0, 1), then taking derivatives shows that the supremum is attained at the unique θ ∈ R satisfying the well-known logistic relation θ = log[μ/(1 − μ)]. Substituting this logistic relation into equation 7.27 yields that for μ ∈ (0, 1), we have A∗ (μ) = μ log μ + (1 − μ) log(1 − μ). By taking limits μ → 1− and μ → 0+, it can be seen that this expression is valid for μ in the closed interval [0, 1]. Figure 7.5 illustrates the behavior of the supremum in equation 7.27 for μ ∈ / [0, 1]. From our geometric interpretation of the value A∗ (μ) in terms of supporting hyperplanes, the dual value is +∞ if no supporting hyperplane can be found. In this particular case, the log partition function A(θ) = log[1 + exp(θ)] is bounded below by the line θ = 0. Therefore, as illustrated in ﬁg. 7.5a, any slope μ < 0 cannot support epi A, which implies that A∗ (μ) = +∞. A similar picture holds for the case μ > 1, as shown in ﬁg. 7.5b. Consequently, the dual function is equal to +∞ for μ ∈ / [0, 1]. ♦

174

A Variational Principle for Graphical Models

#u#

#u#

#xlog2# #xlog2#

#epif#

#epif#

#x#

#x# #linea#

(a)

#linea#

(b)

Figure 7.5 Behavior of the supremum deﬁning A∗ (μ) for (a) μ < 0 and (b) μ > 1. The value of the dual function corresponds to the negative intercept of the supporting hyperplane to epi A with slope μ.

As the preceding example illustrates, there are two aspects to characterizing the dual function A∗ : determining its domain (i.e., the set on which it takes a ﬁnite value); and specifying its precise functional form on the domain. In example 7.7, the domain of A∗ is simply the closed interval [0, 1], and its functional form on its domain is that of the binary entropy function. In the following two sections, we consider each of these aspects in more detail for general graphical models in exponential form. 7.3.2

Sets of Realizable Mean Parameters

For a given μ ∈ Rd , consider the optimization problem on the right-hand side of equation 7.26: since the cost function is diﬀerentiable, a ﬁrst step in the solution is to take the derivative with respect to θ and set it equal to zero. Doing so yields the zero-gradient condition: μ = ∇A(θ) = Eθ [φ(x)],

(7.28)

where the second equality follows from the standard properties of A given in lemma 7.1. We now need to determine the set of μ ∈ Rd for which equation 7.28 has a solution. Observe that any μ ∈ Rd satisfying this equation has a natural interpretation as a globally realizable mean parameter—i.e., a vector that can be realized by taking expectations of the suﬃcient statistic vector φ. This observation motivates deﬁning the following set:

9 8 d φ(x)p(x)ν(dx) = μ , (7.29) M := μ ∈ R ∃ p(·) such that which corresponds to all realizable mean parameters associated with the set of suﬃcient statistics φ.

7.3

An Exact Variational Principle for Inference

175

Example 7.8 Gaussian Mean Parameters The Gaussian MRF, ﬁrst introduced in example 7.3, provides a simple illustration of the set M. Given the suﬃcient statistics that deﬁne a Gaussian, the associated mean parameters are either ﬁrst-order moments (e.g., μs = E[xs ]), or secondorder moments (e.g., μss = E[x2s ] and μst = E[xs xt ]). This full collection of mean parameters can be compactly represented in matrix form: ⎡ ⎤ 1 μ1 μ2 . . . μn ⎥ ⎢ ⎢ μ1 μ11 μ12 . . . μ1n ⎥ ⎢ ⎥ 1 ⎢ ⎥ (7.30) W (μ) := Eθ 1 x = ⎢ μ2 μ21 μ22 . . . μ2n ⎥ . ⎢ x .. .. .. .. ⎥ ⎢ .. ⎥ . . . . ⎦ ⎣ . μn μn1 μn2 . . . μnn The Schur product lemma (Horn and Johnson, 81985) implies det W (μ) = 9 that 8 9 det cov(x), so that a mean parameter vector μ = μs | s ∈ V ∪ μst | (s, t) ∈ E is globally realizable if and only if the matrix W (μ) is strictly positive deﬁnite. Thus, the set M is straightforward to characterize in the Gaussian case. ♦ Example 7.9 Marginal Polytopes We now consider the case of a multinomial MRF, ﬁrst introduced in example 7.4. With the choice of suﬃcient statistics (eq. 7.20), the associated mean parameters are simply local marginal probabilities, viz., μs;j := p(xs = j; θ) ∀ s ∈ V,

μst;jk := p((xs , xt ) = (j, k); θ) ∀ (s, t) ∈ E. (7.31)

In analogy to our earlier deﬁnition of θs (xs ), we deﬁne functional versions of the mean parameters as follows: μs;j I j (xs ), μst (xs , xt ) := μst;jk I jk (xs , xt ). (7.32) μs (xs ) := j∈Xs

(j,k)∈Xs ×Xt

With this notation, the set M consists of all singleton marginals μs (as s ranges over V ) and pairwise marginals μst (for edges (s, t) in the edge set E) that can be realized by a distribution with support on X n . Since the space X n has a ﬁnite number of elements, the set M is formed by taking the convex hull of a ﬁnite number of vectors. As a consequence, it must be a polytope, meaning that it can be described by a ﬁnite number of linear inequality constraints. In this discrete case, we refer to M as a marginal polytope, denoted by MARG(G); see ﬁg. 7.6 for an idealized illustration. As discussed in section 7.4.2, it is straightforward to specify a set of necessary conditions, expressed in terms of local constraints, that any element of MARG(G) must satisfy. However—and in sharp contrast to the Gaussian case—characterizing the marginal polytope exactly for a general graph is intractable, as it must require an exponential number of linear inequality constraints. Indeed, if it were possible to characterize MARG(G) with polynomial-sized set of constraints, then this would imply the polynomial-time solvability of various NP-complete problems (see section 7.7.2 for further discussion of this point). ♦

176

A Variational Principle for Graphical Models

#m

#margset# #p

#hypplane#

Geometrical illustration of a marginal polytope. Each vertex corresponds to the mean parameter μe := φ(e) realized by the distribution δe (x) that puts all of its mass on the conﬁguration e ∈ X n . The faces of the marginal polytope are speciﬁed by hyperplane constraints aj , μ ≤ bj .

Figure 7.6

7.3.3

Entropy in Terms of Mean Parameters

We now turn to the second aspect of the characterization of the conjugate dual function A∗ —that of specifying its precise functional form on its domain M. As might be expected from our discussion of maximum entropy in section 7.2.1, the form of the dual function A∗ turns out to be closely related to entropy. Accordingly, we begin by deﬁning the entropy in a bit more generality: Given a density function p taken with respect to base measure ν, its entropy is given by

H(p) = − p(x) log [p(x)] ν(dx) = −Ep [log p(x)]. (7.33) Xn

With this set up, now suppose that μ belongs to the interior of M. Under this assumption, it can be shown (Brown, 1986; Wainwright and Jordan, 2003b) that there exists a canonical parameter θ(μ) ∈ Θ such that Eθ(μ) [φ(x)] = μ.

(7.34)

Substituting this relation into the deﬁnition of the dual function (eq. 7.26) yields 0 1 A∗ (μ) = μ, θ(μ) − A(θ(μ)) = Eθ(μ) log p(x; θ(μ)) , which we recognize as the negative entropy −H(p(x; θ(μ))), where μ and θ(μ) are dually coupled via equation 7.34. Summarizing our development thus far, we have established that the dual function A∗ has the following form: ' −H(p(x; θ(μ))) if μ belongs to the interior of M ∗ A (μ) = (7.35) +∞ if μ is outside the closure of M. An alternative way to interpret this dual function A∗ is by returning to the maximum entropy problem originally considered in section 7.2.1. More speciﬁcally,

7.4

Exact Inference in Variational Form

177

suppose that we consider the optimal value of the maximum entropy problem given in 7.12, considered parametrically as a function of the constraints μ. Essentially, what we have established is that the parametric form of this optimal value function is the dual function—that is: A∗ (μ) = max H(p) p∈P

such that Ep [φα (x)] = μα for all α ∈ I.

(7.36)

In this context, the property that A∗ (μ) = +∞ for a constraint vector μ outside of M has a concrete interpretation: it corresponds to infeasibility of the maximum entropy problem (eq. 7.12).

Exact variational principle Given the form of the dual function (eq. 7.35), we can now use the conjugate dual relation 7.25 to express A in terms of an optimization problem involving its dual function and the mean parameters: 8 9 (7.37) A(θ) = sup θ, μ − A∗ (μ) . μ∈M

Note that the optimization is restricted to the set M of globally realizable mean parameters, since the dual function A∗ is inﬁnite outside of this set. Thus, we have expressed the cumulant generating function as the solution of an optimization problem that is convex (since it entails maximizing a concave function over the convex set M), and low dimensional (since it is expressed in terms of the mean parameters μ ∈ Rd ). In addition to representing the value A(θ) of the cumulant generating function, the variational principle 7.35 also has another important property. More speciﬁcally, the nature of our dual construction ensures that the optimum is always attained at the vector of mean parameters μ = Eθ [φ(x)]. Consequently, solving this optimization problem yields both the value of the cumulant generating function as well as the full set of mean parameters. In this way, the variational principle 7.37 based on exponential families diﬀers fundamentally from the classical free energy principle from statistical physics.

7.4

Exact Inference in Variational Form In order to illustrate the general variational principle given in equation 7.37, it is worthwhile considering important cases in which it can be solved exactly. Accordingly, this section treats in some detail the case of a Gaussian MRF on an arbitrary graph—for which we rederive the normal equations—as well as the case of a multinomial MRF on a tree, for which we sketch out a derivation of the sumproduct algorithm from a variational perspective. In addition to providing a novel perspective on exact methods, the variational principle 7.37 also underlies a variety of methods for approximate inference, as we will see in section 7.5.

178

A Variational Principle for Graphical Models

7.4.1

Exact Inference in Gaussian Markov Random Fields

We begin by considering the case of a Gaussian Markov random ﬁeld (MRF) on an arbitrary graph, as discussed in examples 7.3 and 7.8. In particular, we showed in the latter example that the set MGauss of realizable Gaussian mean parameters μ is determined by a positive deﬁniteness constraint on the matrix W (μ) of mean parameters deﬁned in equation 7.30. We now consider the form of the dual function A∗ (μ). It is well known (Cover and Thomas, 1991) that the entropy of a multivariate Gaussian random vector can be written as H(p) =

n 1 log det cov(x) + log 2πe, 2 2

where cov(x) is the n×n covariance matrix of x. By recalling the deﬁnition (eq. 7.30) of W (μ) and applying the Schur complement formula (Horn and Johnson, 1985), we see that det cov(x) = det W (μ), which implies that the dual function for a Gaussian can be written in the form n 1 A∗Gauss (μ) = − log det W (μ) − log 2πe, 2 2

(7.38)

valid for all μ ∈ MGauss . (To understand the negative signs, recall from equation 7.35 that A∗ is equal to the negative entropy for μ ∈ MGauss .) Combining this exact expression for A∗Gauss with our characterization of MGauss leads to AGauss (θ) =

sup W (μ) 0, W11 (μ)=1

8

U (θ), W (μ) +

9 n 1 log det W (μ) + log 2πe , 2 2 (7.39)

which corresponds to the variational principle 7.37 specialized to the Gaussian case. We now show how solving the optimization problem 7.39 leads to the normal equations for Gaussian inference. In order to do so, it is convenient to introduce the following notation for diﬀerent blocks of the matrices W (μ) and U (θ): 1 z T (μ) 0 z T (θ) , U (θ) = . (7.40) W (μ) = z(μ) Z(μ) z(θ) Z(θ) In this deﬁnition, the submatrices Z(μ) and Z(θ) are n × n, whereas z(μ) and z(θ) are n × 1 vectors. Now if W (μ) 0 were the only constraint in problem 7.39, then, using the fact that ∇ log det W = W −1 for any symmetric positive matrix W , the optimal solution to problem 7.39 would simply be W (μ) = −2[U (θ)]−1 . Accordingly, if we enforce the constraint [W (μ)]11 = 1 using a Lagrange multiplier λ, then it follows from the Karush-Kuhn-Tucker conditions (Bertsekas, 1995b) that the optimal solution will assume the form W (μ) = −2[U (θ) + λ∗ E11 ]−1 , where λ∗ is the optimal setting of the Lagrange multiplier and E11 is an (n + 1) × (n + 1) matrix with a one in the upper left hand corner, and zero in all other entries. Using the standard

7.4

Exact Inference in Variational Form

179

formula for the inverse of a block-partitioned matrix (Horn and Johnson, 1985), it is straightforward to verify that the blocks in the optimal W (μ) are related to the blocks of U (θ) by the relations: Z(μ) − z(μ)z T (μ) = −2[Z(θ)]−1 −1

z(μ) = −[Z(θ)]

z(θ)

(7.41a) (7.41b)

(The multiplier λ∗ turns out not to be involved in these particular blocks.) In order to interpret these relations, it is helpful to return to the deﬁnition of U (θ) given in equation 7.18, and the Gaussian density of equation 7.19. In this way, we see that the ﬁrst part of equation 7.41 corresponds to the fact that the covariance matrix is the inverse of the precision matrix, whereas the second part corresponds to the normal equations for the mean z(μ) of a Gaussian. Thus, as a special case of the general variational principle 7.37, we have rederived the familiar equations for Gaussian inference. It is worthwhile noting that the derivation did not exploit any particular features of the graph structure. The Gaussian case is remarkable in this regard, in that both the dual function A∗ and the set M of realizable mean parameters can be characterized simply for an arbitrary graph. However, many methods for solving the normal equations 7.41 as eﬃciently as possible, including Kalman ﬁltering on trees (Willsky, 2002), make heavy use of the underlying graphical structure. 7.4.2

Exact Inference on Trees

We now turn to the case of tree-structured Markov random ﬁelds, focusing for concreteness on the multinomial case, ﬁrst introduced in example 7.4 and treated in more depth in example 7.9. Recall from the latter example that for a multinomial MRF, the set M of realizable mean parameters corresponds to a marginal polytope, which we denote by MARG(G). There is an obvious set of local constraints that any member of MARG(G) must satisfy. For instance, given their interpretation as local marginal distributions, the vectors μs and μst must of course be nonnegative. In addition, they must satisfy & normalization conditions (i.e., xs μs (xs ) = 1), and the pairwise marginalization & conditions (i.e., xt μst (xs , xt ) = μs (xs )). Accordingly, we deﬁne for any graph G the following constraint set: μs (xs ) = 1, μst (xs , xt ) = μs (xs ) ∀(s, t) ∈ E}. LOCAL(G) := { μ ≥ 0 | xs

xt

(7.42) Since any set of singleton and pairwise marginals (regardless of the underlying graph structure) must satisfy these local consistency constraints, we are guaranteed that MARG(G) ⊆ LOCAL(G) for any graph G. This fact plays a signiﬁcant role in our later discussion in section 7.6 of the Bethe variational principle and sum-product on graphs with cycles. Of most importance to the current development is the following

180

A Variational Principle for Graphical Models

consequence of the junction tree theorem (see section 7.1.2, subsection on junctiontree representation): when the graph G is tree-structured, then LOCAL(T ) = MARG(T ). Thus, the marginal polytope MARG(T ) for trees has a very simple description given in 7.42. The second component of the exact variational principle 7.37 is the dual function A∗ . Here the junction tree framework is useful again: in particular, specializing the representation in equation 7.7 to a tree yields the following factorization: p(x; μ) =

μs (xs )

s∈V

(s,t)∈E

μst (xs , xt ) μs (xs )μt (xt )

(7.43)

for a tree-structured distribution in terms of its mean parameters μs and μst . From this decomposition, it is straightforward to compute the entropy purely as a function of the mean parameters by taking the logarithm and expectations and simplifying. Doing so yields the expression Hs (μs ) − Ist (μst ), (7.44) −A∗ (μ) = s∈V

(s,t)∈E

where the singleton entropy Hs and mutual information Ist are given by Hs (μs ) := −

μs (xs ) log μs (xs ),

Ist (μst ) :=

xs

μst (xs , xt ) log

xs ,xt

μst (xs , xt ) , μs (xs )μt (xt )

respectively. Putting the pieces together, the general variational principle 7.37 takes the following particular form: 3 2 Hs (μs ) − Ist (μst ) . (7.45) θ, μ + A(θ) = max μ∈LOCAL(T )

s∈V

(s,t)∈E

There is an important link between this variational principle for multinomial MRFs on trees, and the sum-product updates (eq. 7.6). In particular, the sum-product updates can be derived as an iterative algorithm for solving a Lagrangian dual formulation of the problem 7.45. This will be clariﬁed in our discussion of the Bethe variational principle in section 7.6.

7.5

Approximate Inference in Variational Form Thus far, we have seen how well-known methods for exact inference—speciﬁcally, the computation of means and covariances in the Gaussian case and the computation of local marginal distributions by the sum-product algorithm for treestructured problems—can be rederived from the general variational principle (eq. 7.37). It is worthwhile isolating the properties that permit an exact solution of the variational principle. First, for both of the preceding cases, it is possible to characterize the set M of globally realizable mean parameters in a straightfor-

7.5

Approximate Inference in Variational Form

181

ward manner. Second, the entropy can be expressed as a closed-form function of the mean parameters μ, so that the dual function A∗ (μ) has an explicit form. Neither of these two properties holds for a general graphical model in exponential form. As a consequence, there are signiﬁcant challenges associated with exploiting the variational representation. More precisely, in contrast to the simple cases discussed thus far, many graphical models of interest have the following properties: 1. the constraint set M of realizable mean parameters is extremely diﬃcult to characterize in an explicit manner. 2. the negative entropy function A∗ is deﬁned indirectly—in a variational manner— so that it too typically lacks an explicit form. These diﬃculties motivate the use of approximations to M and A∗ . Indeed, a broad class of methods for approximate inference—ranging from mean ﬁeld theory to cluster variational methods—are based on this strategy. Accordingly, the remainder of the chapter is devoted to discussion of approximate methods based on relaxations of the exact variational principle. 7.5.1

Mean Field Theory

We begin our discussion of approximate algorithms with mean ﬁeld methods, a set of algorithms with roots in statistical physics (Chandler, 1987). Working from the variational principle 7.37, we show that mean ﬁeld methods can be understood as solving an approximation thereof, with the essential restriction that the optimization is limited to a subset of distributions for which the dual function A∗ is relatively easy to characterize. Throughout this section, we will refer to a distribution with this property as a tractable distribution. Tractable Families Let H represent a subgraph of G over which it feasible to perform exact calculations (e.g., a graph with small treewidth); we refer to any such H as a tractable subgraph. In an exponential formulation, the set of all distributions that respect the structure of H can be represented by a linear subspace of canonical parameters. More speciﬁcally, letting I(H) denote the subset of indices associated with cliques in H, the set of canonical parameters corresponding to distributions structured according to H is given by: E(H) := {θ ∈ Θ | θα = 0 ∀ α ∈ I\I(H)}.

(7.46)

We consider some examples to illustrate: Example 7.10 Tractable Subgraphs The simplest instance of a tractable subgraph is the completely disconnected graph H0 = (V, ∅) (see ﬁg. 7.7b). Permissible parameters belong to the subspace E(H0 ) := {θ ∈ Θ | θst = 0 ∀ (s, t) ∈ E}, where θst refers to the collection of canonical parameters associated with edge (s, t). The associated distributions are

182

A Variational Principle for Graphical Models

of the product form p(x; θ) = s∈V p(xs ; θs ), where θs refers to the collection of canonical parameters associated with vertex s. To obtain a more structured approximation, one could choose a spanning tree T = (V, E(T )), as illustrated in ﬁg. 7.7c. In this case, we are free to choose the canonical parameters corresponding to vertices and edges in T , but we must set to zero any canonical parameters corresponding to edges not in the tree. Accordingly, the subspace of tree-structured distributions is given by E(T ) = {θ | θst = 0 ∀ (s, t) ∈ / E(T )}. ♦ For a given subgraph H, consider the set of all possible mean parameters that are realizable by tractable distributions: Mtract (G; H) := {μ ∈ Rd | μ = Eθ [φ(x)] for some θ ∈ E(H)}.

(7.47)

The notation Mtract (G; H) indicates that mean parameters in this set arise from taking expectations of suﬃcient statistics associated with the graph G, but that they must be realizable by a tractable distribution—i.e., one that respects the structure of H. See example 7.11 for an explicit illustration of this set when the tractable subgraph H is the fully disconnected graph. Since any μ that arises from a tractable distribution is certainly a valid mean parameter, the inclusion Mtract (G; H) ⊆ M(G) always holds. In this sense, Mtract is an inner approximation to the set M of realizable mean parameters. Optimization and Lower Bounds We now have the necessary ingredients to develop the mean ﬁeld approach to approximate inference. Let p(x; θ) denote the target distribution that we are interested in approximating. The basis of the mean ﬁeld method is the following fact: any valid mean parameter speciﬁes a lower bound on the cumulant generating function. Indeed, as an immediate consequence of the variational principle 7.37, we have: A(θ) ≥ θ, μ − A∗ (μ)

(7.48)

for any μ ∈ M. This inequality can also be established by applying Jensen’s inequality (Jordan et al., 1999). Since the dual function A∗ typically lacks an explicit form, it is not possible, at least in general, to compute the lower bound (eq. 7.48). The mean ﬁeld approach circumvents this diﬃculty by restricting the choice of μ to the tractable subset Mtract (G; H), for which the dual function has an explicit form A∗H . As long as μ belongs to Mtract (G; H), then the lower bound 7.48 will be computable. Of course, for a nontrivial class of tractable distributions, there are many such bounds. The goal of the mean ﬁeld method is the natural one: ﬁnd the best approximation μMF , as measured in terms of the tightness of the bound. This optimal approximation is speciﬁed as the solution of the optimization problem 8 9 μ, θ − A∗H (μ) , (7.49) sup μ∈Mtract (G;H)

7.5

Approximate Inference in Variational Form

(a)

183

(b)

(c)

Figure 7.7 Graphical illustration of the mean ﬁeld approximation. (a) Original graph is a 7 × 7 grid. (b) Fully disconnected graph, corresponding to a naive mean

ﬁeld approximation. (c) A more structured approximation based on a spanning tree.

which is a relaxation of the exact variational principle 7.37. The optimal value speciﬁes a lower bound on A(θ), and it is (by deﬁnition) the best one that can be obtained by using a distribution from the tractable class. An important alternative interpretation of the mean ﬁeld approach is in terms of minimizing the Kullback-Leibler (KL) divergence between the approximating (tractable) distribution and the target distribution. Given two densities p and q, the KL divergence is given by

p(x) log p(x)ν(dx). (7.50) D(p q) = q(x) n X To see the link to our derivation of mean ﬁeld, consider for a given mean parameter μ ∈ Mtract (G; H), the diﬀerence between the log partition function A(θ) and the quantity μ, θ − A∗H (μ): D(μ θ) = A(θ) + A∗H (μ) − μ, θ. A bit of algebra shows that this diﬀerence is equal to the KL divergence 7.50 with q = p(x; θ) and p = p(x; μ) (i.e., the exponential family member with mean parameter μ). Therefore, solving the mean ﬁeld variational problem 7.49 is equivalent to minimizing the KL divergence subject to the constraint that μ belongs to tractable set of mean parameters, or equivalently that p is a tractable distribution. 7.5.2

Naive Mean Field Updates

The naive mean ﬁeld (MF) approach corresponds to choosing a fully factorized or product distribution in order to approximate the original distribution. The naive mean ﬁeld updates are a particular set of recursions for ﬁnding a stationary point of the resulting optimization problem.

184

A Variational Principle for Graphical Models

Example 7.11 As an illustration, we derive the naive mean ﬁeld updates for the Ising model, which is a special case of the multinomial MRF deﬁned in example 7.4. It involves binary variables, so that Xs = {0, 1} for all vertices s ∈ V . Moreover, the canonical parameters are of the form θs (xs ) = θs xs and θst (xs , xt ) = θst xs xt for real numbers θs and θst . Consequently, the exponential representation of the Ising model has the form 9 8 θs xs + θst xs xt . p(x; θ) ∝ exp s∈V

(s,t)∈E

Letting H0 denote the fully disconnected graph (i.e., without any edges), the tractable set Mtract (G; H0 ) consists of all mean parameters {μs , μst } that arise from a product distribution. Explicitly, in this binary case, we have Mtract (G; H0 ) := {(μs , μst ) | 0 ≤ μs ≤ 1, μst = μs μt }. Moreover, the negative entropy of a product distribution over binary random 0 1 & ∗ variables decomposes into the sum AH0 (μ) = s∈V μs log μs +(1−μs ) log(1−μs ) . Accordingly, the associated naive mean ﬁeld problem takes the form 8 9 μ, θ − A∗H0 (μ) . max μ∈Mtract (G;H0 )

In this particular case, it is convenient to eliminate μst by replacing it by the product μs μt . Doing so leads to a reduced form of the problem: 2 3 0 1 μs log μs + (1 − μs ) log(1 − μs ) . max n θs μs + θst μs μt − {μs }∈[0,1]

s∈V

(s,t)∈E

s∈V

(7.51) Let F denote the function of μ within curly braces in equation 7.51. It can be seen that the function F is strictly concave in a given ﬁxed coordinate μs when all the other coordinates are held ﬁxed. Moreover, it is straightforward to show that the maximum over μs with μt , t = s ﬁxed is attained in the interior (0, 1), and can be found by taking the gradient and setting it equal to zero. Doing so yields the following update for μs : (7.52) μs ← σ θs + θst μt , t∈N (s)

where σ(z) := [1 + exp(−z)]−1 is the logistic function. Applying equation 7.52 iteratively to each node in succession amounts to performing coordinate ascent in the objective function for the mean ﬁeld variational problem 7.51. Thus, we have derived the update equation presented earlier in equation 7.10. ♦ Similarly, it is straightforward to apply the naive mean ﬁeld approximation to other types of graphical models, as we illustrate for a multivariate Gaussian.

7.5

Approximate Inference in Variational Form

185

Example 7.12 Gaussian Mean Field The mean parameters for a multivariate Gaussian are of the form μs = E[xs ], μss = E[x2s ] and μst = E[xs xt ] for s = t. Using only Gaussians in product form, the set of tractable mean parameters takes the form Mtract (G; H0 ) = {μ ∈ Rd | μst = μs μt ∀s = t, μss − μ2s > 0 }. As with naive mean ﬁeld on the Ising model, the constraints μst = μs μt for s = t can be imposed directly, thereby leaving only the inequality μss − μ2s > 0 for each node. The negative entropy of a Gaussian in product form can be written &n as A∗Gauss (μ) = − s=1 21 log(μss − μ2s ) − n2 log 2πe. Combining A∗Gauss with the constraints leads to the naive MF problem for a multivariate Gaussian: sup {(μs ,μss ) |

μss −μ2s >0}

8

U (θ), W (μ) +

n 1 s=1

2

log(μss − μ2s ) +

9 n log 2πe , 2

where the matrices U (θ) and W (μ) are deﬁned in equation 7.40. Here it should be understood that any terms μst , s = t contained in W (μ) are replaced with the product μs μt . Taking derivatives with respect to μss and μs and rearranging yields the & stationary conditions 2(μss1−μ2 ) = −θss and 2(μssμs−μ2 ) = θs + t∈N (s) θst μt . Since s s 8 θss < 0, we 9can combine both equations into the update μs ← − θ1ss θs + & t∈N (s) θst μt . In fact, the resulting algorithm is equivalent to the Gauss-Jacobi method for solving the normal equations, and so is guaranteed to converge under suitable conditions (Demmel, 1997), in which case the algorithm computes the ♦ correct mean vector [μ1 . . . μn ]. 7.5.3

Structured Mean Field and Other Extensions

Of course, the essential principles underlying the mean ﬁeld approach are not limited to fully factorized distributions. More generally, one can consider classes of tractable distributions that incorporate additional structure. This structured mean ﬁeld approach was ﬁrst proposed by Saul and Jordan (1996), and further developed by various researchers. In this section, we discuss only one particular example in order to illustrate the basic idea, and refer the interested reader elsewhere (Wainwright and Jordan, 2003b; Wiegerinck, 2000) for further details. Example 7.13 Structured Mean Field for Factorial Hidden Markov Models The factorial hidden Markov model, as described in Ghahramani and Jordan (1997), has the form shown in ﬁg. 7.8a. It consists of a set of M Markov chains (M = 3 in this diagram), which share at each time a common observation (shaded nodes). Such models are useful, for example, in modeling the joint dependencies between speech and video signals over time. Although the separate chains are independent a priori, the common observation induces an eﬀective coupling between all nodes at each time (a coupling which is

186

A Variational Principle for Graphical Models

captured by the moralization process mentioned earlier). Thus, an equivalent model is shown in ﬁg. 7.8b, where the dotted ellipses represent the induced coupling of each observation. #mbeta#

#mgam#

#etaalpha#

(a)

(b)

(c)

Figure 7.8 Structured mean ﬁeld approximation for a factorial HMM. (a) Original model consists of a set of hidden Markov models (deﬁned on chains), coupled at each time by a common observation. (b) An equivalent model, where the ellipses represent interactions among all nodes at a ﬁxed time, induced by the common observation. (c) Approximating distribution formed by a product of chainstructured models. Here μα and μδ are the sets of mean parameters associated with the indicated vertex and edge respectively.

A natural choice of approximating distribution in this case is based on the subgraph H consisting of the decoupled set of M chains, as illustrated in ﬁg. 7.8c. The decoupled nature of the approximation yields valuable savings on the computational side. In particular, it can be shown (Saul and Jordan, 1996; Wainwright and Jordan, 2003b) that all intermediate quantities necessary for implementing the structured mean ﬁeld updates can be calculated by applying the forward-backward algorithm (i.e., the sum-product updates as an exact method) to each chain separately. ♦ In addition to structured mean ﬁeld, there are various other extensions to naive mean ﬁeld, which we mention only in passing here. A large class of techniques, including linear response theory and the TAP method (Kappen and Rodriguez, 1998; Opper and Saad, 2001; Plefka, 1982), seek to improve the mean ﬁeld approximation by introducing higher-order correction terms. Although the lower bound on the log partition function is not usually preserved by these higher-order methods, Leisink and Kappen (2001) demonstrated how to generate tighter lower bounds based on higher-order expansions. 7.5.4

Geometric View of Mean Field

An important fact about the mean ﬁeld approach is that the variational problem (eq. 7.49) may be nonconvex, so that there may be local minima, and the mean ﬁeld updates can have multiple solutions.

7.5

Approximate Inference in Variational Form

187

One way to understand this nonconvexity is in terms of the set of tractable mean parameters: under fairly mild conditions, it can be shown (Wainwright and Jordan, 2003b) that the set Mtract (G; H) is nonconvex. Figure 7.9 provides a geometric illustration for the case of a multinomial MRF, for which the set M is a marginal polytope.

#m #tractpoly#

#margpoly#

The set Mtract (G; H) of mean parameters that arise from tractable distributions is a nonconvex inner bound on M(G). Illustrated here is the multinomial case where M(G) ≡ MARG(G) is a polytope. The circles correspond to mean parameters that arise from delta distributions with all their mass on a single conﬁguration , and belong to both M(G) and Mtract (G; H).

Figure 7.9

A practical consequence of this nonconvexity is that the mean ﬁeld updates are often sensitive to the initial conditions. Moreover, the mean ﬁeld method can exhibit spontaneous symmetry breaking, wherein the mean ﬁeld approximation is asymmetric even though the original problem is perfectly symmetric; see Jaakkola (2001) for an illustration of this phenomenon. Despite this nonconvexity, the mean ﬁeld approximation becomes exact for certain types of models as the number of nodes n grows to inﬁnity (Baxter, 1982). 7.5.5

Parameter Estimation and Variational Expectation Maximization

Mean ﬁeld methods also play an important role in the problem of parameter estimation, in which the goal is to estimate model parameters on the basis of partial observations. The expectation-maximization (EM) algorithm (Dempster et al., 1977) provides a general approach to maximum likelihood parameter estimation in the case in which some subset of variables are observed whereas others are unobserved. Although the EM algorithm is often presented as an alternation between an expectation step (E step) and a maximization step (M step), it is also possible to take a variational perspective on EM, and view both steps as maximization steps (Csiszar and Tusn’ady, 1984; Neal and Hinton, 1999). More concretely, in the exponential family setting, the E step reduces to the computation of expected suﬃcient statistics—i.e., mean parameters. As we have seen, the variational framework pro-

188

A Variational Principle for Graphical Models

vides a general class of methods for computing approximations of mean parameters. This observation suggests a general class of variational EM algorithms, in which the approximation provided by a variational inference algorithm is substituted for the mean parameters in the E step. In general, as a consequence of making such a substitution, one loses the guarantees that are associated with the EM algorithm. In the speciﬁc case of mean ﬁeld algorithms, however, a convergence guarantee is retained: in particular, the algorithm will converge to a stationary point of a lower bound for the likelihood function (Wainwright and Jordan, 2003b).

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm In this section, we turn to another important message-passing algorithm for approximate inference, known either as belief propagation or the sum-product algorithm. In section 7.4.2, we described the use of the sum-product algorithm for trees, in which context it is guaranteed to converge and perform exact inference. When the same message-passing updates are applied to graphs with cycles, in contrast, there are no such guarantees; nonetheless, this “loopy” form of the sum-product algorithm is widely used to compute approximate marginals in various signal-processing applications, including phase unwrapping (Frey et al., 2001), low-level vision (Freeman et al., 2000), and channel decoding (Richardson and Urbanke, 2001). The main idea of this section is the connection between the sum-product updates and the Bethe variational principle. The presentation given here diﬀers from the original work of Yedidia et al. (2001), in that we formulate the problem purely in terms of mean parameters and marginal polytopes. This perspective highlights a key point: mean ﬁeld and sum-product, though similar as message-passing algorithms, are fundamentally diﬀerent at the variational level. In particular, whereas the essence of mean ﬁeld is to restrict optimization to a limited class of distributions for which the negative entropy and mean parameters can be characterized exactly, the the sum-product algorithm, in contrast, is based on enlarging the constraint set and approximating the entropy function. The standard Bethe approximation applies to an undirected graphical model with potential functions involving at most pairs of variables, which we refer to as a pairwise Markov random ﬁeld. In principle, by selectively introducing auxiliary variables, any undirected graphical model can be converted into an equivalent pairwise form to which the Bethe approximation can be applied; see Weiss and Freeman (2000) for a detailed description of this procedure. Moreover, although the Bethe approximation can be developed more generally, we also limit our discussion to a multinomial MRF, as discussed earlier in examples 7.4 and 7.9. We also make use of the local marginal functions μs (xs ) and μst (xs , xt ), as deﬁned in equation 7.32. As discussed in Example 7.9, the set M associated with a multinomial MRF is the marginal polytope MARG(G). Recall that there are two components to the general variational principle 7.37: the set of realizable mean parameters (given by a marginal polytope in this case),

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm

189

and the dual function A∗ . Developing an approximation to the general principle requires approximations to both of these components, which we discuss in turn in the following sections. 7.6.1

Bethe Entropy Approximation

From equation 7.35, recall that dual function A∗ corresponds to the maximum entropy distribution consistent with a given set of mean parameters; as such, it typically lacks a closed-form expression. An important exception to this general rule is the case of a tree-structured distribution: as discussed in section 7.4.2, the function A∗ for a tree-structured distribution has a closed-form expression that is straightforward to compute; see, in particular, equation 7.44. Of course, the entropy of a distribution deﬁned by a graph with cycles will not, in general, decompose additively like that of a tree. Nonetheless, one can imagine using the decomposition in equation 7.44 as an approximation to the entropy. Doing so yields an expression known as the Bethe approximation to the entropy on a graph with cycles: Hs (μs ) − Ist (μst ). (7.53) HBethe (μ) := s∈V

(s,t)∈E

To be clear, the quantity HBethe (μ) is an approximation to the negative dual function −A∗ (μ). Moreover, our development in section 7.4.2 shows that this approximation is exact when the graph is tree-structured. An alternative form of the Bethe entropy approximation can be derived by writing mutual information in terms of entropies as Ist (μst ) = Hs (μs ) + Ht (μt ) − Hst (μst ). In particular, expanding the mutual information terms in this way, and & then collecting all the single-node entropy terms yields HBethe (μ) = s∈V (1 − & ds )Hs (μs ) + (s,t)∈E Hst (μst ), where ds denotes the number of neighbors of node s. This representation is the form of the Bethe entropy introduced by Yedidia et al. (2001); however, the form given in equation 7.53 turns out to be more convenient for our purposes. 7.6.2

Tree-Based Outer Bound

Note that the Bethe entropy approximation HBethe is certainly well deﬁned for any μ ∈ MARG(G). However, as discussed earlier, characterizing this polytope of realizable marginals is a very challenging problem. Accordingly, a natural approach is to specify a subset of necessary constraints, which leads to an outer bound on MARG(G). Let τs (xs ) and τst (xs , xt ) be a set of candidate marginal distributions. In section 7.4.2, we considered the following constraint set: τs (xs ) = 1, τst (xs , xt ) = τt (xt ) }. (7.54) LOCAL(G) = { τ ≥ 0 | xs

xs

190

A Variational Principle for Graphical Models

Although LOCAL(G) is an exact description of the marginal polytope for a treestructured graph, it is only an outer bound for graphs with cycles. (We demonstrate this fact more concretely in example 7.14.) For this reason, our change in notation— i.e., from μ to τ —is quite deliberate, with the goal of emphasizing that members τ of LOCAL(G) need not be realizable. We refer to members of LOCAL(G) as pseudomarginals (these are sometimes referred to as beliefs). Example 7.14 Pseudomarginals We illustrate using a binary random vector on the simplest possible graph for which LOCAL(G) is not an exact description of MARG(G)—namely, a single cycle with three nodes. Consider candidate marginal distributions {τs , τst } of the form βst 0.5 − βst , (7.55) τst := τs := 0.5 0.5 , 0.5 − βst βst where βst ∈ [0, 0.5] is a parameter to be speciﬁed independently for each edge (s, t). It is straightforward to verify that {τs , τst } belong to LOCAL(G) for any choice of βst ∈ [0, 0.5]. First, consider the setting βst = 0.4 for all edges (s, t), as illustrated in ﬁg. 7.10a. It is not diﬃcult to show that the resulting marginals thus deﬁned are realizable; in fact, they can be obtained from the distribution that places probability 0.35 on each of the conﬁgurations [0 0 0] and [1 1 1], and probability 0.05 on each of the remaining six conﬁgurations. Now suppose that we perturb one of the pairwise marginals—say τ13 —by setting β13 = 0.1. The resulting problem is illustrated in ﬁg. 7.10b. Observe that there are now strong (positive) dependencies between the pairs of variables (x1 , x2 ) and (x2 , x3 ): both pairs are quite likely to agree (with probability 0.8). In contrast, the pair (x1 , x3 ) can only share the

#Margset# 2

2

3

1

3

1

#Tmargset#

(a)

(b)

(c)

(a), (b): Illustration of the marginal polytope for a single cycle graph on three nodes. Setting βst = 0.4 for all three edges gives a globally consistent set of marginals. (b) With β13 perturbed to 0.1, the marginals (though locally consistent) are no longer globally so. (c) For a more general graph, an idealized illustration of the tree-based constraint set LOCAL(G) as an outer bound on the marginal polytope MARG(G).

Figure 7.10

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm

191

same value relatively infrequently (with probability 0.2). This arrangement should provoke some doubt. Indeed, it can be shown that τ ∈ / MARG(G) by attempting but failing to construct a distribution that realizes τ , or alternatively and much more directly by using the idea of semideﬁnite constraints (see example 7.15). ♦ More generally, ﬁgure 7.10c provides an idealized illustration of the constraint set LOCAL(G), and its relation to the exact marginal polytope MARG(G). Observe that the set LOCAL(G) is another polytope that is a convex outer approximation to MARG(G). It is worthwhile contrasting with the nonconvex inner approximation used by a mean ﬁeld approximation, as illustrated in ﬁg. 7.9. 7.6.3

Bethe Variational Problem and Sum-Product

Note that the Bethe entropy is also well deﬁned for any pseudomarginal in LOCAL(G). Therefore, it is valid to consider a constrained optimization problem over the set LOCAL(G) in which the cost function involves the Bethe entropy approximation HBethe . Indeed, doing so leads to the so-called Bethe variational problem: 8 9 θ, τ + Hs (τs ) − Ist (τst ) . (7.56) max τ ∈LOCAL(G)

s∈V

(s,t)∈E

Although ostensibly similar to a (structured) mean ﬁeld approach, the Bethe variational problem (BVP) is fundamentally diﬀerent in a number of ways. First, as discussed in section 7.5.1, a mean ﬁeld method is based on an exact representation of the entropy, albeit over a limited class of distributions. In contrast, with the exception of tree-structured graphs, the Bethe entropy is a bona ﬁde approximation to the entropy. For instance, it is not diﬃcult to see that it can be negative, which of course can never happen for an exact entropy. Second, the mean ﬁeld approach entails optimizing over an inner bound on the marginal polytope, which ensures that any mean ﬁeld solution is always globally consistent with respect to at least one distribution, and that it yields a lower bound on the log partition function. In contrast, since LOCAL(G) is a strict outer bound on the set of realizable marginals MARG(G), the optimizing pseudomarginals τ ∗ of the BVP may not be globally consistent with any distribution. 7.6.4

Solving the Bethe Variational Problem

Having formulated the Bethe variational problem, we now consider iterative methods for solving it. Observe that the set LOCAL(G) is a polytope deﬁned by O(n + |E|) constraints. A natural approach to solving the BVP, then, is to attach Lagrange multipliers to these constraints, and ﬁnd stationary points of the Lagrangian. A remarkable fact, established by Yedidia et al. (2001), is that the sum-product updates of equation 7.6 can be rederived as a method for trying to ﬁnd such Lagrangian stationary points.

192

A Variational Principle for Graphical Models

A bit more formally, for each xs ∈ Xs , let λst (xs ) be a Lagrange multiplier as& sociated with the constraint Cts (xs ) = 0, where Cts (xs ) := τs (xs ) − xt τst (xs , xt ). Our approach is to consider the following partial Lagrangian corresponding to the Bethe variational problem 7.56: 0 1 L(τ ; λ) := θ, τ + HBethe (τ ) + λts (xs )Cts (xs ) + λst (xt )Cst (xt ) . (s,t)∈E

xs

xt

The key insight of Yedidia et al. (2001) is that any ﬁxed point of the sum-product updates speciﬁes a pair (τ ∗ , λ∗ ) such that ∇τ L(τ ∗ ; λ∗ ) = 0,

∇λ L(τ ∗ ; λ∗ ) = 0

(7.57)

In particular, the Lagrange multipliers can be used to specify messages of the form Mts (xs ) = exp(λts (xs )). After taking derivatives of the Lagrangian and equating them to zero, some algebra then yields the familiar message-update rule: 8 9 exp θst (xs , xt ) + θt (xt ) Mut (xt ). (7.58) Mts (xs ) = κ xt

u∈N (t)\s

We refer the reader to Yedidia et al. (2001) or Wainwright and Jordan (2003b) for further details of this derivation. By construction, any ﬁxed point M ∗ of these updates speciﬁes a pair (τ ∗ , λ∗ ) that satisﬁes the stationary3 conditions given in equation 7.57. This variational formulation of the sum-product updates—namely, as an algorithm for solving a constrained optimization problem—has a number of important consequences. First of all, it can be used to guarantee the existence of sum-product ﬁxed points. Observe that the cost function in the Bethe variational problem 7.56 is continuous and bounded above, and the constraint set LOCAL(G) is nonempty and compact; therefore, at least some (possibly local) maximum is attained. Moreover, since the constraints are linear, there will always be a set of Lagrange multipliers associated with any local maximum (Bertsekas, 1995b). For any optimum in the relative interior of LOCAL(G), these Lagrange multipliers can be used to construct a ﬁxed point of the sum-product updates. For graphs with cycles, this Lagrangian formulation provides no guarantees on the convergence of the sum-product updates; indeed, whether or not the algorithm converges depends both on the potential strengths and the topology of the graph. Several researchers (Heskes et al.; Welling and Teh, 2001; Yuille, 2002) have proposed alternatives to sum-product that are guaranteed to converge, albeit at the price of increased computational cost. It should also be noted that with the exception of trees and other special cases (McEliece and Yildirim, 2002; Pakzad and Anantharam, 2002), the BVP is usually a nonconvex problem, in that HBethe fails to be concave. As a consequence, there may be multiple local optima to the BVP, and there are no guarantees that sum-product (or other iterative algorithms) will ﬁnd a global optimum.

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm

193

As illustrated in ﬁg. 7.10c, the constraint set LOCAL(G) of the Bethe variational problem is a strict outer bound on the marginal polytope MARG(G). Since the exact marginals of p(x; θ) must always lie in the marginal polytope, a natural question is whether solutions to the Bethe variational problem ever fall into the region LOCAL(G)\ MARG(G). There turns out be a straightforward answer to this question, stemming from an alternative reparameterization-based characterization of sum-product ﬁxed points (Wainwright et al., 2003b). One consequence of this characterization is that for any vector τ of pseudomarginals in the interior of LOCAL(G), it is possible to specify a distribution for which τ is a sum-product ﬁxed point. As a particular example, it is possible to construct a distribution p(x; θ) such that the pseudomarginal τ discussed in example 7.14 is a ﬁxed point of the sum-product updates. 7.6.5

Extensions Based on Clustering And Hypertrees

From our development in the previous section, it is clear that there are two distinct components to the Bethe variational principle: (1) the entropy approximation HBethe , and (2) the approximation LOCAL(G) to the set of realizable marginal parameters. In principle, the BVP could be strengthened by improving either one, or both, of these components. One natural generalization of the BVP, ﬁrst proposed by Yedidia et al. (2002) and further explored by various researchers (Heskes et al.; McEliece and Yildirim, 2002; Minka, 2001), is based on working with clusters of variables. The approximations in the Bethe approach are based on trees, which are special cases of junction trees based on cliques of size two. A natural strategy, then, is to strengthen the approximations by exploiting more complex junction trees, also known as hypertrees. Our description of this procedure is very brief, but further details can be found in various sources (Wainwright and Jordan, 2003b; Yedidia et al., 2002). Recall that the essential ingredients in Bethe variational principle are local (pseudo)marginal distributions on nodes and edges (i.e., pairs of nodes). These distributions, subject to edgewise marginalization conditions, are used to specify the Bethe entropy approximation. One way to improve the Bethe approach, which is based on pairs of nodes, is to build entropy approximations and impose marginalization constraints on larger clusters of nodes. To illustrate, suppose that the original graph is simply the 3 × 3 grid shown in ﬁg. 7.11a. A particular grouping of the nodes, which is known as Kikuchi four-plaque clustering in statistical physics (Yedidia et al., 2002), is illustrated in ﬁg. 7.11b. This operation creates four new “supernodes” or clusters, each consisting of four nodes from the original graph. These clusters, as well as their overlaps—which turn out to be critical to track for certain technical reasons (Yedidia et al., 2002)—are illustrated in ﬁg. 7.11c. Given a clustering of this type, we now consider a set of marginal distributions τh , where h ranges over the clusters. As with the singleton τs and pairwise τst that deﬁne the Bethe approximation, we require that these higher-order cluster & marginals be suitably normalized (i.e., x τh (xh ) = 1), and be consistent with one h

194

A Variational Principle for Graphical Models

1

2

3 1

2

3 1245

4

5

2356 25

6 4

5

6

7

8

9

45

5

56

58

7

8 (a)

4578

5689

9 (b)

(c)

Figure 7.11 (a) Ordinary 3 × 3 grid. (b) Clustering of the vertices into groups of 4, known as Kikuchi four-plaque clustering. (c) Poset diagram of the clusters as

well as their overlaps. Pseudomarginals on these subsets must satisfy certain local consistency conditions, and are used to deﬁne a higher-order entropy approximation. another whenever they overlap. More precisely, for any pair g ⊆ h, the following & = τg (xg ) must hold. Imposing marginalization condition {xh | xg =xg } τh (xh ) these normalization and marginalization conditions leads to a higher-order analog of the constraint LOCAL(G) previously deﬁned in equation 7.54. In analogy to the Bethe entropy approximation, we can also consider a hypertree-based approximation to the entropy. There are certain technical aspects to specifying such entropy approximations, in that it turns out to be critical to ensure that the local entropies are weighted with certain “overcounting” numbers (Wainwright and Jordan, 2003b; Yedidia et al., 2002). Without going into these details here, the outcome is another relaxed variational principle, which can be understood as a higher-level analog of the Bethe variational principle.

7.7

From the Exact Principle to New Approximations The preceding sections have illustrated how a variety of known methods—both exact and approximate—can be understood in an uniﬁed manner on the basis of the general variational principle given in equation 7.37. In this ﬁnal section, we turn to a brief discussion of several new approximate methods that also emerge from this same variational principle. Given space constraints, our discussion in this chapter is necessarily brief, but we refer to reader to the papers of (Wainwright and Jordan, 2003a,b; Wainwright et al., 2002, 2003a) for further details. 7.7.1

Exploiting Semideﬁnite Constraints for Approximate Inference

As discussed in section 7.5, one key component in any relaxation of the exact variational principle is an approximation of the set M of realizable mean parameters. Recall that for graphical models that involve discrete random variables, we refer to

7.7

From the Exact Principle to New Approximations

195

this set as a marginal polytope. Since any polytope is speciﬁed by a ﬁnite collection of halfspace constraints (see ﬁg. 7.6), one very natural way in which to generate an outer approximation is by including only a subset of these halfspace constraints. Indeed, as we have seen in section 7.6, it is precisely this route that the Bethe approximation and its clustering-based extensions follow. However, such polyhedral relaxations are not the only way in which to generate outer approximations to marginal polytopes. Recognizing that elements of the marginal polytope are essentially moments leads very naturally to the idea of a semideﬁnite relaxation. Indeed, the use of semideﬁnite constraints for characterizing moments has a very rich history, both with classical work (Karlin and Studden, 1966) on scalar random variables, and more recent work (Lasserre, 2001; Parrilo, 2003) on the multivariate case. Semideﬁnite Outer Bounds on Marginal Polytopes We use the case of a multinomial MRF deﬁned by a graph G = (V, E), as discussed in example 7.4, in order to illustrate the use of semideﬁnite constraints. Although the basic idea is quite generally applicable (Wainwright and Jordan, 2003b), herein we restrict ourselves to binary variables (i.e., Xs = {0, 1}) so as to simplify the exposition. Recall that the suﬃcient statistics in a binary MRF take the form of certain indicator functions, as deﬁned in equation 7.20. In fact, this representation is overcomplete (in that there are linear dependencies among the indicator functions); in the binary case, it suﬃces to consider only the suﬃcient statistics xs = I 1 (xs ) and xs xt = I 11 (xs , xt ). Our goal, then, is to characterize the set of all ﬁrst- and secondorder moments, deﬁned by μs = E[xs ] and μst = E[xs xt ] respectively, that arise from taking expectations with respect to a distribution with its support restricted to {0, 1}n . Rather than focusing on just the pairs μst for edges (s, t) ∈ E, it is convenient to consider the full collection of pairwise moments {μst |s, t ∈ V }. Suppose that we are given a vector μ ∈ Rd (where d = n + n2 ), and wish to assess whether or not it is a globally realizable moment vector (i.e., whether there & & exists some distribution p(x) such that μs = x p(x) xs and μst = x p(x) xs xt ). In order to derive a necessary condition, we suppose that such a distribution p exists, and then consider the following (n + 1) × (n + 1) moment matrix: ⎡ ⎤ 1 μ1 μ2 · · · μn−1 μn ⎢ ⎥ ⎢ μ1 μ1 μ12 · · · ··· μ1n ⎥ ⎢ ⎥ ⎢ ⎥ ' ( ⎢ ⎥ · · · · · · μ μ μ μ 2 21 2 2n 1 ⎢ ⎥ (7.59) = ⎢ .. Ep .. .. .. .. .. ⎥, 1 x ⎢ ⎥ x . . . . . ⎢ . ⎥ ⎢ ⎥ .. .. .. .. ⎢μ . . . . μn,(n−1) ⎥ ⎣ n−1 ⎦ μn μn1 μn2 · · · μ(n−1),n μn which we denote by M1 [μ]. Note that in calculating the form of this moment matrix, we have made use of the relation μs = μss , which holds because xs = x2s for any binary-valued quantity.

196

A Variational Principle for Graphical Models

We now observe that any such moment matrix is necessarily positive semideﬁnite, which we denote by M1 [μ] ! 0. (This positive semideﬁniteness can be veriﬁed as follows: letting y := (1, x), then for any vector a ∈ Rn+1 , we have aT M1 [μ]a = aT E[yyT ]a = E[ aT y 2 ], which is certainly nonnegative). Therefore, we conclude that the semideﬁnite constraint set SDEF1 := {μ ∈ Rd | M1 [μ] ! 0} is an outer bound on the exact marginal polytope. Example 7.15 To illustrate the use of the outer bound SDEF1 , recall the pseudomarginal vector τ that we constructed in example 7.14 for the single cycle on three nodes. In terms of our reduced representation (involving only expectations of the singletons xs and pairwise functions xs xt ), this pseudomarginal can be written as follows: τs = 0.5 for s = 1, 2, 3,

τ12 = τ23 = 0.4,

Suppose that we now construct the matrix M1 it takes the following form: ⎡ 1 0.5 ⎢ ⎢0.5 0.5 M1 [τ ] = ⎢ ⎢0.5 0.4 ⎣ 0.5 0.1

τ13 = 0.1.

for this trial set of mean parameters; ⎤ 0.5 ⎥ 0.4 0.1⎥ ⎥. 0.5 0.4⎥ ⎦ 0.4 0.5 0.5

A simple calculation shows that it is not positive deﬁnite, so that τ ∈ / SDEF1 . Since SDEF1 is an outer bound on the marginal polytope, this reasoning shows—in a very quick and direct manner— that τ is not a globally valid moment vector. In fact, the semideﬁnite constraint set SDEF1 can be viewed as the ﬁrst in a sequence of progressively tighter relaxations on the marginal polytope. Log-Determinant Relaxation We now show how to use such semideﬁnite constraints in approximate inference. Our approach is based on combining the ﬁrstorder semideﬁnite outer bound SDEF1 with Gaussian-based entropy approximation. The end result is a log-determinant problem that represents another relaxation of the exact variational principle (Wainwright and Jordan, 2003a). In contrast to the Bethe and Kikuchi approaches, this relaxation is convex (and hence has a unique optimum), and moreover provides an upper bound on the cumulant generating function. Our starting point is the familiar interpretation of the Gaussian as the maximum entropy distribution subject to covariance constraints (Cover and Thomas, :, its diﬀerential entropy 1991). In particular, given a continuous random vector x h(: x) is always upper bounded by the entropy of a Gaussian with matched covariance, or in analytical terms h(: x) ≤

n 1 log det cov(: x) + log(2πe), 2 2

(7.60)

:. The upper bound 7.60 is not directly where cov(: x) is the covariance matrix of x

7.7

From the Exact Principle to New Approximations

197

applicable to a random vector taking values in a discrete space (since diﬀerential entropy in this case diverges to minus inﬁnity). However, a straightforward discretization argument shows that for any discrete random vector x ∈ {0, 1}n , its (ordinary) discrete entropy can be upper bounded in terms of the matrix M1 [μ] of mean parameters as n 1 1 blkdiag[0, In ] + log(2πe), (7.61) H(x) = −A∗ (μ) ≤ log det M1 [μ] + 2 2 12 where blkdiag[0, In ] is a (n + 1) × (n + 1) block-diagonal matrix with a 1 × 1 zero block, and an n × n identity block. Finally, putting all the pieces together leads to the following result (Wainwright and Jordan, 2003a): the cumulant generating function A(θ) is upper bounded by the solution of the following log-determinant optimization problem: 2 3 0 1 1 1 n blkdiag[0, In ] + log(2πe). θ, μ + log det M1 (τ ) + A(θ) ≤ max τ ∈SDEF1 2 12 2 (7.62) Note that the constraint τ ∈ SDEF1 ensures that M1 (τ ) ! 0, and hence a fortiori 1 blkdiag[0, In ] is positive deﬁnite. Moreover, an important fact is that that M1 (τ )+ 12 the optimization problem in equation 7.62 is a determinant maximization problem, for which eﬃcient interior point methods have been developed (Vandenberghe et al., 1998). Just as the Bethe variational principle (eq. 7.56) is a tree-based approximation, the log-determinant relaxation (eq. 7.62) is a Gaussian-based approximation. In particular, it is worthwhile comparing the structure of the log-determinant relaxation (eq. 7.62) to the exact variational principle for a multivariate Gaussian, as described in section 7.4.1. In contrast to the Bethe variational principle, in which all of the constraints deﬁning the relaxation are local, this new principle (eq. 7.62) imposes some quite global constraints on the mean parameters. Empirically, these global constraints are important for strongly coupled problems, in which the performance log-determinant relaxation appears much more robust than the sum-product algorithm (Wainwright and Jordan, 2003a). In summary, starting from the exact variational principle (eq. 7.37), we have derived a new relaxation, whose properties are rather diﬀerent than the Bethe and Kikuchi variational principles. 7.7.2

Relaxations for Computing Modes

Recall from our introductory comments in section 7.1.2 that, in addition to the problem of computing expectations and likelihoods, it is also frequently of interest to compute the mode of a distribution. This section is devoted to a brief discussion of mode computation, and more concretely how the exact variational principle (eq. 7.37), as well as relaxations thereof, again turns out to play an important role.

198

A Variational Principle for Graphical Models

Zero-Temperature Limits In order to understand the role of the exact variational principle 7.37 in computing modes, consider a multinomial MRF of the form p(x; θ), as discussed in example 7.4. Of interest to us is the one-parameter family of distributions {p(x; βθ) | β > 0}, where β is the real number to be varied. At one extreme, if β = 0, then there is no coupling, and the distribution is simply uniform over all possible conﬁgurations. The other extreme, as β → +∞, is more interesting; in this limit, the distribution concentrates all of its mass on the conﬁguration (or subset of conﬁgurations) that are modes of the distribution. Taking this limit β → +∞ is known as zero-temperature limit, since the parameter β is typically viewed as inverse temperature in statistical physics. This argument suggests that there should be a link between computing modes and the limiting behavior of the marginalization problem as β → +∞. In order to develop this idea a bit more formally, we begin by observing that the exact variational principle 7.37 holds for the distribution p(x; βθ) for any value of β ≥ 0. It can be shown (Wainwright and Jordan, 2003b) that if we actually take a suitably scaled limit of this exact variational principle as β → +∞, then we recover the following variational principle for computing modes: max θ, φ(x) =

x∈X n

max μ∈MARG(G)

θ, μ

(7.63)

Since the log probability log p(x; θ) is equal to θ, φ(x) (up to an additive constant), the left-hand side is simply the problem of computing the mode of the distribution p(x; θ). On the right-hand side, we simply have a linear program, since the constraint set MARG(G) is a polytope, and the cost function θ, μ is linear in μ (with θ ﬁxed). This equivalence means that, at least in principle, we can compute a mode of the distribution by solving a linear program (LP) over the marginal polytope. The geometric interpretation is also clear: as illustrated in ﬁg. 7.6, vertices of the marginal polytope are in one-to-one correspondence with conﬁgurations x. Since any LP achieves its optimum at a vertex (Bertsimas and Tsitsiklis, 1997), solving the LP is equivalent to ﬁnding the mode. Linear Programming and Tree-Reweighted Max-Product Of course, the LP-based reformulation in equation 7.63 is not practically useful for precisely the same reasons as before—it is extremely challenging to characterize the marginal polytope MARG(G) for a general graph. Many computationally intractable optimization problems (e.g., MAX-CUT) can be reformulated as LPs over the marginal polytope, as in equation 7.63, which underscores the inherent complexity of characterizing marginal polytopes. Nonetheless, this variational formulation motivates the idea of forming relaxations using outer bounds on the marginal polytope. For various classes of problems in combinatorial optimization, both linear programming and semideﬁnite relaxations of this ﬂavor have been studied extensively. Here we brieﬂy describe an LP relaxation that is very natural given our development of the Bethe variational principle in section 7.6. In particular, we consider using the local constraint set LOCAL(G), as deﬁned in equation 7.54, as

7.7

From the Exact Principle to New Approximations

199

an outer bound of the marginal polytope MARG(G). Doing so leads to the following LP relaxation for the problem of computing the mode of a multinomial MRF: max θ, φ(x) =

x∈X n

max μ∈MARG(G)

θ, μ ≤

max

τ ∈LOCAL(G)

θ, μ.

(7.64)

Since the relaxed constraint set LOCAL(G)—like the original set MARG(G)—is a polytope, the relaxation on the right-hand side of equation 7.64 is a linear program. Consequently, the optimum of the relaxed problem must be attained at a vertex (possibly more than one) of the polytope LOCAL(G).

#eparam1#

#eparam1#

#Margset#

#Margset#

The constraint set LOCAL(G) is an outer bound on the exact marginal polytope. Its vertex set includes all the vertices of MARG(G), which are in one-toone correspondence with optimal solutions of the integer program. It also includes additional fractional vertices, which are not vertices of MARG(G).

Figure 7.12

We say that a vertex of LOCAL(G) is integral if all of its components are zero or one, and fractional otherwise. The distinction between fractional and integral vertices is crucial, because it determines whether or not the LP relaxation 7.64 speciﬁed by LOCAL(G) is tight. In particular, there are only two possible outcomes to solving the relaxation: 1. The optimum is attained at a vertex of MARG(G), in which case the upper bound in equation 7.64 is tight, and a mode can be obtained. 2. The optimum is attained only at one or more fractional vertices of LOCAL(G), which lie strictly outside MARG(G). In this case, the upper bound of equation 7.64 is loose, and the relaxation does not output the optimal conﬁguration. Figure 7.12 illustrates both of these possibilities. The vector θ1 corresponds to case 1, in which the optimum is attained at a vertex of MARG(G). The vector θ2 represents a less fortunate setting, in which the optimum is attained only at a fractional vertex of LOCAL(G). In simple cases, one can explicitly demonstrate a fractional vertex of the polytope LOCAL(G). Given the link between the sum-product algorithm and the Bethe variational principle, it would be natural to conjecture that the max-product algorithm can be derived as an algorithm for solving the LP relaxation 7.64. For trees (in which

200

A Variational Principle for Graphical Models

case the LP 7.64 is exact), this conjecture is true: more precisely, it can be shown (Wainwright et al., 2003a) that the max-product algorithm (or the Viterbi algorithm) is an iterative method for solving the dual problem of the LP 7.64. However, this statement is false for graphs with cycles, since it is straightforward to construct problems (on graphs with cycles) for which the max-product algorithm will output a nonoptimal conﬁguration. Consequently, the max-product algorithm does not specify solutions to the dual problem, since any LP relaxation will output either a conﬁguration with a guarantee of correctness, or a fractional vertex. However, Wainwright et al. (2003a) derive a tree-reweighted analog of the maxproduct algorithm, which does have provable connections to dual optimal solutions of the tree-based relaxation 7.64.

7.8

Conclusion A fundamental problem that arises in applications of graphical models—whether in signal processing, machine learning, bioinformatics, communication theory, or other ﬁelds—is that of computing likelihoods, marginal probabilities, and other expectations. We have presented a variational characterization of the problem of computing likelihoods and expectations in general exponential-family graphical models. Our characterization focuses attention on both the constraint set and the objective function. In particular, for exponential-family graphical models, the constraint set M is a convex subset in a ﬁnite-dimensional space, consisting of all realizable mean parameters. The objective function is the sum of a linear function and an entropy function. The latter is a concave function, and thus the overall problem—that of maximizing the objective function over M—is a convex problem. In this chapter, we discussed how the junction tree algorithm and other exact inference algorithms can be understood as particular methods for solving this convex optimization problem. In addition, we showed that a variety of approximate inference algorithms—including loopy belief propagation, general cluster variational methods and mean ﬁeld methods—can be understood as methods for solving particular relaxations of the general variational principle. More concretely, we saw that belief propagation involves an outer approximation of M whereas mean ﬁeld methods involve an inner approximation of M. In addition, this variational principle suggests a number of new inference algorithms, as we brieﬂy discussed. It is worth noting certain limitations inherent to the variational framework as presented in this chapter. In particular, we have not discussed curved exponential families, but instead limited our treatment to regular families. Curved exponential families are useful in the context of directed graphical models, and further research is required to develop a general variational treatment of such models. Similarly, we have dealt exclusively with exponential family models, and not treated nonparametric models. One approach to exploiting variational ideas for nonparametric models

NOTES

201

is through exponential family approximations of nonparametric distributions; for example, Blei and Jordan (2004) have presented inference methods for Dirichlet process mixtures that are based on the variational framework presented here.

Notes

1 The

Gaussian case is an important exception to this statement. graph is triangulated means that every cycle of length four or longer has a chord. 3 Some care is required in dealing with the boundary conditions τ (x ) ≥ 0 and τ (x , x ) ≥ 0; s s st s t see Yedidia et al. (2001) for further discussion. 2 That

8

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

Hans-Georg Zimmermann, Ralph Grothmann, Anton Maximilian Sch¨ afer, and Christoph Tietz

Recurrent neural networks are typically considered to be relatively simple architectures, which come along with complicated learning algorithms. Most researchers focus on improving these algorithms. Our approach is diﬀerent: Rather than focusing on learning and optimization algorithms, we concentrate on the network architecture. Unfolding in time is a well-known example of this modeling philosophy. Here, a temporal algorithm is transferred into an architectural framework such that the learning can be done using an extension of standard error backpropagation. As we will show, many diﬃculties in the modeling of dynamical systems can be solved with neural network architectures. We exemplify architectural solutions for the modeling of open systems and the problem of unknown external inﬂuences. Another research area is the modeling of high-dimensional systems with large neural networks. Instead of modeling, e.g., a ﬁnancial market as small sets of time series, we try to integrate the information from several markets into an integrated model. Standard neural networks tend to overﬁt, like other statistical learning systems. We will introduce a new recurrent neural network architecture in which overﬁtting and the associated loss of generalization abilities is not a major problem. In this context we will point to diﬀerent sources of uncertainty which have to be handled when dealing with recurrent neural networks. Furthermore, we will show that sparseness of the network’s transition matrix is not only important to dampen overﬁtting but also provides new features such as an optimal memory design.

8.1

Introduction Recurrent neural networks (RNNs) allow the identiﬁcation of dynamical systems in the form of high-dimensional, nonlinear state space models. They oﬀer an explicit modeling of time and memory and allow us, in principle, to model any type of dy-

204

recurrent networks

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

neural

error correction neural networks

dynamical consistent neural networks

namical systems (Elman, 1990; Haykin, 1994; Kolen and Kremer, 2001; Medsker and Jain, 1999). The basic concept is as old as the theory of artiﬁcial neural networks, so, e.g., unfolding in time of neural networks and related modiﬁcations of the backpropagation algorithm can be found in Werbos (1974) and Rumelhart et al. (1986). Diﬀerent types of learning algorithms are summarized by Pearlmutter (1995). Nevertheless, over the last 15 years most time series problems have been approached with feedforward neural networks. The appeal of modeling time and memory in recurrent networks is opposed to the apparently better numerical tractability of a pattern-recognition approach as represented by feedforward neural networks. Still, some researchers did enhance the theory of recurrent neural networks. Recent developments are summarized in the books of Haykin (1994), Kolen and Kremer (2001), Sooﬁ and Cao (2002), and Medsker and Jain (1999). Our approach diﬀers from the outlined research directions in a signiﬁcant but, at ﬁrst sight nonobvious, way. Instead of focusing on algorithms, we put network architectures in the foreground. We show that a network architecture automatically implies using an adjoint solution algorithm for the parameter identiﬁcation problem. This correspondence between architecture and equations holds for simple as well as complex network architectures. The underlying assumption is that the associated parameter optimization problem is solved by error backpropagation through time, i.e., a shared weights extension of the standard error backpropagation algorithm. In technical and economical applications virtually all systems of interest are open dynamical systems (see section 8.2). This means that the dynamics of the system is determined partly by an autonomous development and partly by external drivers of the system environment. The measured data always reﬂect a superposition of both parts. If we are interested in forecasting the development of the system, extracting the autonomous subsystem is the most relevant task. It is the only part of the open system that can be predicted (see subsec. 8.2.3). A related question is the sequence length of the unfolding in time which is necessary to approximate the recurrent system (see subsec. 8.2.2). The outlined concepts are only applicable if we have a perfectly speciﬁed open dynamical system, where all external drivers are known. Unfortunately, this assumption is virtually never fulﬁlled in real-world applications. Even if we knew all the external system drivers, it would be questionable whether an appropriate amount of training data would be available. As a consequence, the task of identifying the open system is misspeciﬁed right from the beginning. On this problem, we introduce error correction neural networks (ECNN) (Zimmermann et al., 2002b) (see section 8.2.4). Another weakness of our modeling framework is the implicit assumption that we only have to analyze a small number of time series. This is also uncommon in real-world applications. For instance, in economics we face coherent markets and not a single interest or foreign exchange rate. A market or a complex technical plant is intrinsically high dimensional. Now the major problem is that all our neural networks tend to overﬁt if we increase the model dimensionality in order to approach the true high-dimensional system dynamics. We therefore present recurrent network

8.1

Introduction

uncertainties

function & structure

205

architectures, which work even for very large state spaces (see section 8.3). These networks also combine diﬀerent operations of small neural networks (e.g., processing of input information) into one shared state transition matrix. Our experiments indicate that this stabilizes the model behavior to a large extent (see section 8.3.1). If one iterates an open system into the future, the standard assumption is that the system environment remains constant. As this is not true for most real-world applications, we introduce dynamical consistent recurrent neural networks, which try to forecast also the external inﬂuences (see section 8.3.2). We then combine the concepts of large networks and dynamic consistency with error correction. We show that ECNNs can be extended in a slightly diﬀerent way than basic recurrent networks (see section 8.3.3). We also demonstrate that some types of dynamical systems can more easily be analyzed with a (dynamical consistent) recurrent network, while others are more appropriate for ECNNs. Our intention is to merge the diﬀerent aspects of the competing network architectures within a single recurrent neural network. We call it DCNN, for dynamical consistent neural network. We found that DCNNs allow us to model even small deviations in the dynamics without losing the generalization abilities of the model. We point out that the networks presented so far create state trajectories of the dynamics that are close to the observed ones, whereas the DCNN evolves exactly on the observed trajectories (see section 8.3.4). Finally, we introduce a DCNN architecture for partially known observables to generalize our models from a diﬀerentiation between past and future to the (time-independent) availability of information (see section 8.3.5). The identiﬁcation and forecasting of dynamical systems has to cope with a number of uncertainties in the underlying data as well as in the development of the dynamics (see section 8.4). Cleaning noise is a technique which allows the model itself—within the training process—to correct corrupted or noisy data (see section 8.4.1). Working with ﬁnite unfolding in time brings up the problem of initializing the internal state at the ﬁrst time step. We present diﬀerent approaches to achieve a desensitization of the model’s behavior from the initial state and simultaneously improve the generalization abilities (see section 8.4.2). To stabilize the network against uncertainties of the environment’s future development we further apply noise to the inputs in the future part of the network (see section 8.4.3). Working with (high-dimensional) recurrent networks raises the question of how a desired network function can be supported by a certain structure of the transition matrix (see section 8.5). As we will point out, sparseness alone is not suﬃcient to optimize the network functions regarding conservation and superposition of information. Only with an inﬂation of the internal dimension of the recurrent neural network can we implement an optimal balance between memory and computation eﬀects (see sections 8.5.1 and 8.5.2). In this context we work out that sparseness of the transition matrix is actually a necessary condition for large neural networks (see section 8.5.3). Furthermore we analyze the information ﬂow in sparse networks and present an architectural solution which speeds up the distribution of information (see section 8.5.4).

206

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

Finally, section 8.6 summarizes our contributions to the research ﬁeld of recurrent neural networks.

8.2

Recurrent Neural Networks (RNN) Figure 8.1) illustrates a dynamical system (Zimmermann and Neuneier, 2001, p. 321).

y

Dynamical System

s

u Figure 8.1 Identiﬁcation of a dynamical system using a discrete time description: input u, hidden states s, and output y .

The dynamical system (ﬁg. 8.1) can be described for discrete time grids as a set of equations (eq. 8.1), consisting of a state transition and an output equation (Haykin, 1994; Kolen, 2001):

system identiﬁcation

st+1 = f (st , ut )

state transition

yt

output equation

= g(st )

(8.1)

The state transition is a mapping from the present internal hidden state of the system st and the inﬂuence of external inputs ut to the new state st+1 . The output equation computes the observable output yt . The system can be viewed as a partially observable autoregressive dynamic state transition st → st+1 which is also driven by external forces ut . Without the external inputs the system is called an autonomous system (Haykin, 1994; Mandic and Chambers, 2001). However, in reality most systems are driven by a superposition of an autonomous development and external inﬂuences. The task of identifying the dynamical system of equation 8.1 can be stated as the problem of ﬁnding (parameterized) functions f and g such that a distance measurement (eq. 8.2) between the observed data ytd and the computed data yt of

8.2

Recurrent Neural Networks (RNN)

207

the model is minimal:1 T

yt − ytd

2

→ min f,g

t=1

(8.2)

If we assume that the state transition does not depend on st , i.e., yt = g(st ) = g(f (ut−1 )), we are back in the framework of feedforward neural networks (Neuneier and Zimmermann, 1998). However, the inclusion of the internal hidden dynamics makes the modeling task much harder, because it allows varying intertemporal dependencies. Theoretically, in the recurrent framework an event st+1 is explained by a superposition of external inputs ut , ut−1 , . . . from all previous time steps (Haykin, 1994; Mandic and Chambers, 2001). 8.2.1 basic RNN

Representing Dynamic Systems by Recurrent Neural Networks

The identiﬁcation task of equations 8.1 and 8.2 can be easily modeled by a recurrent neural network (Haykin, 1994; Zimmermann and Neuneier, 2001) st+1 = tanh(Ast + c + But )

state transition

yt

output equation

= Cst

(8.3)

where A, B, and C are weight matrices of appropriate dimensions and c is a bias, which handles oﬀsets in the input variables ut . Note that the output equation yt = Cst is implemented as a linear function. It is straightforward to show that this is not a functional restriction by using an augmented inner state vector (Zimmermann and Neuneier, 2001, pp. 322–323). By specifying the functions f and g as a neural network with weight matrices A, B and C and a bias vector c, we have transformed the system identiﬁcation task of equation 8.2 into a parameter optimization problem: T

yt − ytd

t=1

2

→ min

A,B,C,c

(8.4)

As Hornik et al. (1992) proved for feedforward neural networks, it can be shown that recurrent neural networks (eq. 8.3) are universal approximators, as they can approximate any arbitrary dynamical system (eq. 8.1) with a continuous output function g. 8.2.2

Finite Unfolding in Time

In this section we discuss an architectural representation of recurrent neural networks that enables us to solve the parameter optimization problem of equation 8.4 by an extended version of standard backpropagation (Haykin, 1994; Rumelhart et al., 1986).2 Figure 8.2 unfolds the network of equation 8.3 (ﬁg. 8.2, left) over time using shared weight matrices A, B, and C (ﬁg. 8.2, right). Shared weights

208

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

share the same memory for storing their weights, i.e., the weight values are the same at each time step of the unfolding and for every pattern t ∈ {1, . . . , T } (Haykin, 1994; Rumelhart et al., 1986). This guarantees that we have in every time step the same dynamics.

y

C

C s

c

advantages of the RNN

(...)

(...)

y t−2

C A

s t−3

c B

u

backpropagation through time

A c

B

Figure 8.2

y t−3

(...)

C A

s t−2

ut−4

B ut−3

yt

s t−1

A

st

c B ut−2

y t+1

C

C A c

c B

y t−1

C A

s t+1

c B ut−1

B ut

Finite unfolding using shared weight matrices A, B , and C .

We approximate the recurrence of the system with a ﬁnite unfolding which truncates after a certain number of time steps m ∈ N. The important question to solve is the determination of the correct amount of past information needed to predict yt+1 . Since the outputs are explained by more and more external information, the error of the outputs is decreasing with each additional time step from left to right until a minimum error is achieved. This saturation level indicates the maximum number of time steps m which contribute relevant information for modeling the present time state. A more detailed description is given in Zimmermann and Neuneier (2001). We train the unfolded recurrent neural network shown in ﬁg. 8.2 (right) with error backpropagation through time, which is a shared weights extension of standard backpropagation (Haykin, 1994; Rumelhart et al., 1986). Error backpropagation is an eﬃcient way of calculating the partial derivatives of the network error function. Thus, all parts of the network are provided with error information. In contrast to typical feedforward neural networks, RNNs are able to explicitly model memory. This allows the identiﬁcation of intertemporal dependencies. Furthermore, recurrent networks contain less free parameters. In a feedforward neural network an expansion of the delay structure automatically increases the number of weights (left panel of ﬁg. 8.3). In the recurrent formulation, the shared matrices A, B, and C are reused when more delayed input information from the past is needed (right panel of ﬁg. 8.3). Additionally, if weights are shared more often, more gradient information is available for learning. As a consequence, potential overﬁtting is not as dangerous in recurrent as in feedforward networks. Due to the inclusion of temporal structure in the network architecture, our approach is applicable to tasks where only a small training set is available (Zimmermann and Neuneier, 2001, p. 325).

8.2

Recurrent Neural Networks (RNN)

209

y t+1

y t−2

V

C

C

s t−2

Hidden c

c W

A

s t−1

c

ut−3

y t+1

C A

st

c B

B

ut−3 ut−2 ut−1 ut

yt

y t−1

C A

B

ut−2

s t+1

c B

ut−1

ut

Figure 8.3 An additional time step leads in the feedforward framework (left) with yt+1 = V tanh(W u + c) to a higher dimension of the input vector u, whereas the number of free parameters remains constant in recurrent networks (right), due to the use of shared weights.

8.2.3

Overshooting

An obvious generalization of the network in ﬁg. 8.2 is the extension of the autonomous recurrence (matrix A) in future direction t + 2, t + 3, . . . (see ﬁg. 8.4) (Zimmermann and Neuneier, 2001, pp. 326–327). If this so-called overshooting leads to good predictions, we get a whole sequence of forecasts as an output. This is especially interesting for decision support systems. The number of autonomous iterations into the future, which we deﬁne with n ∈ N, most often depends on the required forecast horizon of the application. Note that overshooting does not add new parameters, since the shared weight matrices A and C are reused.

y t−2

y t−1

C s t−2 c

ut−3

Figure 8.4

C A

s t−1

c B

yt

st

c B ut−2

C

C A

A

C

s t+1

c B ut−1

y t+3

y t+2

y t+1

A c

s t+2

y t+4

C A c

s t+3

C A

s t+4

c

B ut

Overshooting extends the autonomous part of the dynamics.

The most important property of the overshooting network (ﬁg. 8.4) is the concatenation of an input-driven system and an autonomous system. One may argue that the unfolding-in-time network (ﬁg. 8.2) already consists of recurrent functions, and that this recurrent structure has the same modeling characteristics as the overshooting network. This is deﬁnitely not true, because the learning algorithm

210

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

leads to diﬀerent models for each of the architectures. Backpropagation learning usually tries to model the relationship between the most recent inputs and the output because the fastest adaptation takes place in the shortest path between input and output. Thus, learning mainly focuses on ut . Only later in the training process may learning also extract useful information from input vectors uτ (t − m ≤ τ < t) which are more distant from the output. As a consequence, the unfolding-in-time network (ﬁg. 8.2, right) tries to rely as much as possible on the part of the dynamics which is driven by the most recent inputs ut , . . . , ut−k with k < m. In contrast, the overshooting network (ﬁg. 8.4) forces the learning through additional future outputs yt+2 , . . . , yt+n to focus on modeling an internal autonomous dynamics (Zimmermann and Neuneier, 2001). In summary, overshooting generates additional valuable forecast information about the analyzed dynamical system and stabilizes learning. 8.2.4

Error Correction Neural Networks (ECNN)

If we have a complete description of all external inﬂuences, recurrent neural networks (eq. 8.3) allow us to identify the intertemporal relationships (Haykin, 1994). Unfortunately, our knowledge about the external forces is typically incomplete and our observations might be noisy. Under such conditions, learning with ﬁnite data sets leads to the construction of incorrect causalities due to learning by heart (overﬁtting). The generalization properties of such a model are questionable (Neuneier and Zimmermann, 1998). If we are unable to identify the underlying system dynamics due to insuﬃcient input information or unknown inﬂuences, we can refer to the actual model error yt − ytd , which can be interpreted as an indicator that our model is misleading. Handling this error information as an additional input, we extend equation 8.1, obtaining: st+1 = f (st , ut , yt − ytd ), yt

error correction network

= g(st ).

(8.5)

The state transition st+1 is a mapping from the previous state st , external inﬂuences ut , and a comparison between model output yt and observed data ytd . If the model error (yt − ytd ) is zero, we have a perfect description of the dynamics. However, due to unknown external inﬂuences or noise, our knowledge about the dynamics is often incomplete. Under such conditions, the model error (yt − ytd ) quantiﬁes the model’s misﬁt and serves as an indicator of short-term eﬀects or external shocks (Zimmermann et al., 2002b). Using weight matrices A, B, C and D of appropriate dimensions corresponding to st , ut , and (yt −ytd ) and a bias c, a neural network approach to 8.5 can be written

8.2

Recurrent Neural Networks (RNN)

211

as st+1 = tanh(Ast + c + But + D tanh(Cst − ytd )), yt

system identiﬁcation

In 8.6 the output yt is computed by Cst and compared to the observation ytd . The matrix D adjusts a possible diﬀerence in the dimension between the error correction term and st . The system identiﬁcation is now a parameter optimization task of appropriately sized weight matrices A, B, C, D, and the bias c (Zimmermann et al., 2002b): T

yt − ytd

2

→

t=1

ﬁnite unfolding

(8.6)

= Cst .

min

(8.7)

A,B,C,D,c

We solve the system identiﬁcation task of 8.7 by ﬁnite unfolding in time using shared weights (see section 8.2.2). Figure 8.5 depicts the resulting neural network solution of 8.6. A zt−2 D

s t−1 C

zt−1 D

c

−Id d yt−2

A

A

st

C

zt

D

c

B ut−2

−Id d yt−1

s t+1 C

c

B ut−1

−Id ytd

A s t+2 C

yt+1 c

s t+3 C

yt+2

yt+3

c

B ut

Error correction neural network (ECNN) using unfolding in time and overshooting. Note that −Id is the ﬁxed negative of an appropriate-sized identity matrix, while zτ with t − m ≤ τ ≤ t are output clusters with target values of zero in order to optimize the error correction mechanism.

Figure 8.5

overshooting

The ECNN (eq. 8.6) is best understood by analyzing the dependencies of st , ut , zt = Cst − ytd , and st+1 . The ECNN has two diﬀerent inputs: the externals ut directly inﬂuencing the state transition; and the targets ytd . Only the diﬀerence between yt and ytd has an impact on st+1 (Zimmermann et al., 2002b). At all future time steps t < τ ≤ t + n, we have no compensation of the internal expectations yτ , and thus the system oﬀers forecasts yτ = Csτ . The autonomous part of the ECNN is—analogous to the RNN case (see section 8.2.3)—extended into the future by overshooting. Besides all advantages described in section 8.2.3, overshooting inﬂuences the learning of the ECNN in an extended way. A forecast provided by the ECNN is in general, based on a modeling of the recursive structure of a dynamical system (coded in the matrix A) and on the error correction mechanism which acts as an external input (coded in C and D). Now, the overshooting enforces an autoregressive substructure allowing longterm forecasts. Of course, we have to supply target values for the additional output

212

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

clusters yτ , t < τ ≤ t + n. Due to the shared weights, there is no change in the number of model parameters (Zimmermann et al., 2002b).

8.3

Dynamical Consistent Neural Networks (DCNN)

market dynamics

overﬁtting

The neural networks described in section 8.2 not only learn from data, but also integrate prior knowledge and ﬁrst principles into the modeling in the form of architectural concepts. However, the question arises if the outlined neural networks are a suﬃcient framework for the modeling of complex nonlinear dynamical systems, which can only be understood by analyzing the interrelationship of diﬀerent subdynamics. Consider the following economic example: The dynamics of the US dollar–euro foreign exchange market is clearly inﬂuenced by the development of other major foreign exchange, stock or commodity markets (Murphy, 1999). In other words, movements of the US dollar–euro foreign exchange rate can only be comprehended by a combined analysis of the behavior of other coherent markets. This means that a model of the US dollar and euro foreign exchange market must also learn the dynamics of related markets and intermarket dependencies. Now it is important to note that, due to their computational power (in the sense of modeling highdimensional nonlinear dynamics), the described medium-sized recurrent neural networks are only capable of modeling a single market’s dynamics. From this point of view an integrated approach of market modeling is hardly possible within the framework of those networks. Hence, we need large neural networks. A simple scaling up of the presented neural networks would be misleading. Our experiments indicate that scaling up the networks by increasing the dimension of the internal state results in overﬁtting due to the large number of free parameters. Overﬁtting is a critical issue, because the neural network does not only learn the underlying dynamics, but also the noise included in the data. Especially in economic applications, overﬁtting poses a serious problem. In this section we deal with architectures which are feasible for large recurrent neural networks. These architectures are based on a redesign of the recurrent neural networks introduced in section 8.2. Most of the resulting networks cannot even be designed with a low-dimensional internal state (see section 8.3.1). In addition, we focus on a consistency problem of traditional statistical modeling: Typically one assumes that the environment of the system remains unchanged when the dynamics is iterated into the future. We show that this is a questionable statistical assumption, and solve the problem with a dynamical consistent recurrent neural network (see section 8.3.2). Thereafter, we deal with large error correction networks and integrate dynamical consistency into this framework (see section 8.3.3). Finally, we point out that large RNNs and large ECNNs are appropriate for diﬀerent types of dynamical systems. Our intention is to merge the diﬀerent characteristics of the two models in a uniﬁed neural network architecture. We call it DCNN for dynamical consistent neural network (see section 8.3.4). Finally we discuss the problem of

8.3

Dynamical Consistent Neural Networks (DCNN)

213

partially known observables (see section 8.3.5). 8.3.1

Normalization of Recurrent Networks

Let us revisit the basic time-delay recurrent neural network of 8.3. The state transition equation st is a nonlinear combination of the previous state st−1 and external inﬂuences ut using matrices A and B. The network output yt is computed from the present state st employing matrix C. The network output is therefore a nonlinear composition applying the transformations A, B, and C. In preparation for the development of large networks we ﬁrst separate the state equation of the recurrent network (eq. 8.3) into a past and a future part. In this framework st is always regarded as the present time state. That means that for this pattern t all states sτ with τ ≤ t belong to the past part and those with τ > t to the future part. The parameter τ is hereby always bounded by the length of the unfolding m and the length of the overshooting n (see sections 8.2.2 and 8.2.3), such that we have τ ∈ {t − m, . . . , t + n} for all t ∈ {m, . . . , T − n} with T as the available number of data patterns. The present time (τ = t) is included in the past part, as these state transitions share the same characteristics. We get the following representation of the optimization problem: τ ≤t:

sτ +1 = tanh(Asτ + c + Buτ )

τ >t:

sτ +1 = tanh(Asτ + c) yτ

= Csτ ,

T −n

t+n

. (yτ − yτd )2 → min

t=m τ =t−m

normalized recurrent networks

(8.8)

A,B,C,c

As shown in section 8.2, these equations can be easily transformed into a neural network architecture (see ﬁg. 8.4). In this model, past and future iterations are consistent under the assumption of a constant future environment. The diﬃculty with this kind of recurrent neural network is the training with backpropagation through time, because a sequence of diﬀerent connectors has to be balanced. The gradient computation is not regular, i.e., we do not have the same learning behavior for the weight matrices in diﬀerent time steps. In our experiments we found that this problem becomes more important for training large recurrent neural networks. Even the training itself is unstable due to the concatenated matrices A, B, and C. As training changes weights in all of these matrices, diﬀerent eﬀects or tendencies—even opposing ones—can inﬂuence them and may superpose. This implies that no clear learning direction or weight changes result from a certain backpropagated error. The question arises of how to redesign the basic recurrent architecture (eq. 8.8) to improve learning behavior and stability especially for large networks. As a solution, we propose the neural network of 8.9, which incorporates besides the bias c only one connector, the matrix A. The corresponding architecture is depicted in ﬁg. 8.6. Note that from now on we change the formulation of the system

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

00

s t−2

Id

s t−1 c

0 0 Id

ut−2

Figure 8.6

00

c

Id

A

yt

A

y t+2

y t+1

00

Id

y t−1

Id

A

st c

00

y t−2

Id

s t+1 c

A

00

214

s t+2 c

0 0 Id

0 0 Id

ut−1

ut

Normalized recurrent neural network.

equations (e.g., eq. 8.8) from a forward (st+1 = f (st , ut )) to a backward formulation (st = f (st−1 , ut )). As we will see, the backward formulation is internally equivalent to a forward model. ⎤ ⎞ ⎛ ⎡ 0 ⎥ ⎟ ⎜ ⎢ ⎥ ⎟ ⎢ τ ≤t: sτ = tanh ⎜ ⎝Asτ −1 + c + ⎣ 0 ⎦ uτ ⎠ Id τ >t:

.

sτ = tanh(Asτ −1 + c) yτ = [Id 0 0]sτ ,

T −n

t+n

(8.9)

(yτ − yτd )2 → min

t=m τ =t−m

A,c

We call this model a normalized recurrent neural network (NRNN). It avoids the stability and learning problems resulting from the concatenation of the three matrices A, B, and C. The modeling is now focused solely on the transition matrix A. The matrices between input and hidden as well as between hidden and output layers are ﬁxed and therefore not learned during the training process. This implies that all free parameters—as they are combined in one matrix—are now treated the same way by backpropagation. It is important to note that the normalization or concentration on only one single matrix is paid for with an oversized (high-dimensional) internal state. At ﬁrst view it seems that in this network architecture (ﬁg. 8.6) the external input uτ is directly connected to the corresponding output yτ . This is not the case, though, because we increase the dimension of the internal state sτ , such that the input uτ has no direct inﬂuence on the output yτ . Assuming that we have a number p of network outputs, q computational hidden neurons, and r external inputs, the dimension of the internal state would be dim(s) ≥ p + q + r. With the matrix [Id 0 0] we connect only the ﬁrst p neurons of the internal state sτ to the output layer yτ . This connector is a ﬁxed identity matrix of appropriate size. Consequently, the neural network is forced to generate the p outputs of the neural network at the ﬁrst p components of the state vector sτ . Let us now focus on the last r state neurons, which are used for the processing

8.3

Dynamical Consistent Neural Networks (DCNN)

large networks

modeling observables

215

of the external inputs uτ . The connector [0 0 Id]T between the externals uτ and the internal state sτ is an appropriately sized ﬁxed identity matrix. More precisely, the connector is designed such that the input uτ is connected to the last state neurons. Recalling that the network outputs are located at the ﬁrst p internal states, this composition avoids a direct connection between input and output. It delays the impact of the externals uτ on the outputs yτ by at least one time step. To additionally support the internal processing and to increase the network’s computational power, we add a number q of hidden neurons between the ﬁrst p and the last r state neurons. This composition ensures that input and output processing of the network are separate. Besides the bias vector c the state transition matrix A holds the only tunable parameters of the system. Matrix A does not only code the autonomous and the externally driven parts of the dynamics, but also the processing of the external inputs uτ and the computation of the network outputs yτ . The bias added to the internal state handles oﬀsets in the input variables uτ . Remarkably, the normalized recurrent network of 8.9 can only be designed as a large neural network. If the internal network state is too small, the inputs and outputs cannot be separated, as the external inputs would at least partially cover the internal states at which the outputs are read out. Thus, the identiﬁcation of the network outputs at the ﬁrst p internal states would become impossible. Our experiments indicate that recurrent neural networks in which the only tunable parameters are located in a single state transition matrix (e.g., eq. 8.9) show a more stable training process, even if the dimension of the internal state is very large. Having trained the large network to convergence, many weights of the state transition matrix will be dispensable without derogating the functioning of the network. Unneeded weights can be singled out by using a weight decay penalty and standard pruning techniques (Haykin, 1994; Neuneier and Zimmermann, 1998). In the normalized recurrent neural network (eq. 8.9) we consider inputs and outputs independently. This distinction between externals uτ and the network output yτ is arbitrary and mainly depends on the application or the view of the model builder instead of the real underlying dynamical system. Therefore, for the following model we take a diﬀerent point of view. We merge inputs and targets into one group of variables, which we call observables. So we now look at the model as a high-dimensional dynamical system where input and output represent the observable variables of the environment. The hidden units stand for the unobservable part of the environment, which nevertheless can be reconstructed from the observations. This is an integrated view of the dynamical system. We implement this approach by replacing the externals uτ with the (observable) targets yτd in the normalized recurrent network. Consequently, the output yτ

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

and the external input yτd have now identical dimensions. ⎤ ⎞ ⎛ ⎡ 0 ⎥ ⎟ ⎜ ⎢ ⎢ 0 ⎥ yτd ⎟ As τ ≤t: sτ = tanh ⎜ + c + τ −1 ⎦ ⎠ ⎝ ⎣ Id τ >t:

.

sτ = tanh(Asτ −1 + c) T −n

yτ = [Id 0 0]sτ ,

t+n

(8.10)

(yτ − yτd )2 → min A,c

t=m τ =t−m

The corresponding model architecture is shown in ﬁg. 8.7.

00

s t−2

Id

s t−1 c

0 0 Id

ydt−2

Figure 8.7

A

00

c

Id

yt

A

Id

st c

0 0 Id

y t+2

y t+1

00

Id

y t−1

A

00

y t−2

s t+1 c

Id

A

00

216

s t+2 c

0 0 Id

ydt−1

ydt

Normalized recurrent net modeling the dynamics of observables yτd .

Note that, because of the one-step time delay between input and output, yτd and yτ are not directly connected. Furthermore, it is important to understand that we now take a totally diﬀerent view of the dynamical system. In contrast to 8.9, this network (eq. 8.10) not only generates forecasts for the dynamics of interest but for all external observables yτd . Consequently, the ﬁrst r state neurons are used for the identiﬁcation of the network outputs. They are followed by q computational hidden neurons, and r state neurons that read in the external inputs. 8.3.2

Dynamical Consistent Recurrent Neural Networks (DCRNN)

The models presented so far are all statistical but not dynamical consistent, as we assume that the environment stays constant for the future part of the network. In the following we improve our models with dynamical consistency. An open dynamical system is partially driven by an autonomous development and partially by external inﬂuences. When the dynamics is iterated into the future, the development of the system environment is unknown. Now, one of the standard statistical paradigms is to assume that the external inﬂuences are not signiﬁcantly changing in the future part. This means that the expected value of a shift in an external input yτd with τ > t is 0 by deﬁnition. For that reason we have so far

8.3

Dynamical Consistent Neural Networks (DCNN)

217

neglected the external inputs yτd in the normalized recurrent neural network at all future unfolding time steps, τ > t (see eq. 8.10). Especially when we consider fast-changing external variables with a high impact on the dynamics of interest, the above assumption is very questionable. In relation to 8.10 it even poses a contradiction, as the observables are assumed to be constant on the input and variable on the output side. Even in case of a slowly changing environment, long-term forecasts become doubtful. The longer the forecast horizon is, the more the statistical assumption is violated. A statistical model is therefore not consistent from a dynamical point of view. For a dynamical consistent approach, one has to integrate assumptions about the future development of the environment into the modeling of the dynamics. For that reason we propose a network that uses its own predictions as replacements for the unknown future observables. This is expressed by an additional ﬁxed matrix in the state equation. The resulting DCRNN is: ⎤ ⎤ ⎡ ⎡ Id 0 0 0 ⎥ d ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ τ ≤t: sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh(Asτ −1 + c) + ⎣ 0 ⎦ yτ ⎡ τ >t:

sτ

0

0 0

Id

⎤

Id 0 0 ⎥ ⎢ ⎢ = ⎣ 0 Id 0 ⎥ ⎦ tanh(Asτ −1 + c) Id 0 0

yτ =

[Id 0 0]

sτ ,

T −n

(8.11)

t+n

(yτ − yτd )2 → min

t=m τ =t−m

Similarly to the end of section 8.3.1, we look at the state vector sτ in a very structured way. The recursion of the state equations (eq. 8.11) acts in the past (τ ≤ t) and future (τ > t) always on the same partitioning of that vector. For all τ ∈ {t − m, . . . , t + n}, sτ can be described as ⎤ ⎡ ⎤ ⎡ expectations yτ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ hidden states hτ ⎥ ⎥=⎢' ( ' ( (8.12) sτ = ⎢ ⎢ τ ≤ t : y d ⎥ ⎢ τ ≤ t : observations ⎥ . ⎦ ⎦ ⎣ ⎣ τ τ > t : expectations τ > t : yτ

state vector

consistency trices

A,c

ma-

This means that in the ﬁrst r components of the state vector we have the expectations yτ , i.e., the predictions of the model. The q components in the middle of the vector represent the hidden units hτ . They are actually responsible for the development of the dynamics. In the last r components of the vector we ﬁnd in the past (τ ≤ t) the observables yτd , which the model receives as external input. In the future (τ > t) the model replaces these unknown future observables by its own expectations yτ . This replacement is modeled with two consistency matrices:

218

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

⎤

⎡

⎤

⎡

Id 0 0

Id 0 0

⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ C≤ = ⎢ ⎣ 0 Id 0 ⎦ and C> = ⎣ 0 Id 0 ⎦ . 0 0 0 Id 0 0

Let us explain one recursion of the state equation (eq. 8.11) in detail: In the past (τ ≤ t) we start with a state vector sτ −1 , which has the structure of 8.12. This vector is ﬁrst multiplied with the transition matrix A. After adding the bias c, the vector is sent through the nonlinearity tanh. The consistency matrix then keeps the ﬁrst r +q components (expectations and hidden states) of the state vector but deletes (by multiplication with zero) the last r ones. These are ﬁnally replaced by the observables yτd , such that sτ again has the partitioning of 8.12. Note that in contrast to the normalized recurrent neural network (eq. 8.10) the observables are now added to the state vector after the nonlinearity. This is important for the consistency structure of the model. The recursion in the future state transition (τ > t) diﬀers from the one in the past in terms of the structure of the consistency matrix and the missing external input. The latter is now replaced with an additional identity block in the future consistency matrix C> , which maps the ﬁrst r components of the state vector, the expectations yτ , to its last r components. Thus we get the desired partitioning of sτ (eq. 8.12) and the model becomes dynamical consistent. Figure 8.8 illustrates this architecture. Note that the nonlinearity and the ﬁnal calculation of the state vector are separate and hence modeled in two diﬀerent layers. This follows from the dynamical consistent state equation (eq. 8.11), in which the observables are added separately from the nonlinear component. Regarding the single transition matrix A, we want to point out that in a statistical consistent recurrent network (eq. 8.10) the matrix has to model the state transformation over time and the merging of the input information. However, the

y t−1 00

s t−1

Id

A c

0 0 Id

ydt−1

non− linear

y t+1

00

Id

yt

C

00

transition matrix

(8.13)

s t+1

0 0 Id

ydt

Dynamical consistent recurrent neural network (DCRNN). At all future time steps of the unfolding the network uses its own forecasts as substitutes for the unknown development of the environment.

Figure 8.8

8.3

Dynamical Consistent Neural Networks (DCNN)

219

network is only triggered by the external drivers up to the present time step t. In a dynamical consistent network we have forecasts of the external inﬂuences, which can be used as future inputs. Thus, the transition matrix A is always dedicated to the same task: modeling the dynamics. 8.3.3

Dynamical Consistent Error Correction NNs (DCECNN)

The ECNN is a nonlinear state space model employing the shared weight matrices A, B, C, and D (eq. 8.6). Matrix A computes the state transformation over time, B processes the external input information, C derives the network output, and D is responsible for the error correction mechanism. The latter composition of nonlinear transformations A, B, C, and D is diﬃcult to handle when the network’s internal state is high dimensional. Therefore we developed a dynamical consistent error correction neural network (DCECNN) of the form of 8.14. It is an analogous approach to the DCRNN (eq. 8.11) and consequently the equations are very similar. The only changes concern the two consistency matrices C≤ and C> . ⎤

⎡ Id

τ ≤t:

τ >t:

⎤

⎡

0 0

0

⎥ ⎥ d ⎢ ⎢ ⎥ ⎥ ⎢ sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh(Asτ −1 + c) + ⎣ 0 ⎦ yτ Id −Id 0 0 ⎤ ⎡ Id 0 0 ⎥ ⎢ ⎥ sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh(Asτ −1 + c) 0 0 0 yτ =

[Id 0 0]

sτ ,

T −n

t+n

(yτ − yτd )2 → min .

t=m τ =t−m

A,c

(8.14) state vector

Due to the error correction, the deﬁnition or partitioning of the state vector diﬀers in the last r components. We now have for all τ ∈ {t − m, . . . , t + n} ⎤ ⎡ ⎤ ⎡ yτ expectations ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ hidden states hτ ⎥ ⎥=⎢' ( ' ( (8.15) sτ = ⎢ ⎢ τ ≤t: e ⎥ ⎢ τ ≤ t : error correction ⎥ . ⎣ ⎦ ⎦ ⎣ τ τ >t: 0 τ >t: 0 In the past part (τ ≤ t) we get the error correction term in the state vector by subtracting the expectations yτ from the observations yτd . This is performed by the negative identity matrix −Id within the consistency matrix C≤ . In the future part (τ > t) we expect that our model is correct. Therefore we replace the error correction by zero. The future consistency matrix C> simply overwrites the last r

220

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

components of the state vector with zero. Analogous to the DCRNN, the internal transition matrix A is only used for the modeling of the dynamics over time. The graphical illustration of a dynamical consistent error correction neural network is identical to the recurrent one (ﬁg. 8.8), but note, that the consistency matrices C≤ and C> have changed their structure. 8.3.4

Dynamical Consistent Neural Networks (DCNN)

Dynamical consistent neural networks(see section 8.3.2) are most appropriate if the observed dynamics is not hidden by noise and evolves smoothly over time, e.g., modeling of a sine curve. However, modeling can only be successful if we know all external drivers of the system and the dynamics is not inﬂuenced by external shocks. In many real-world applications, e.g. trading (see Zimmermann et al. (2002a)), this is simply not true. The dynamics of interest is often covered with noise. External shocks or unknown external inﬂuences disturb the system dynamics. In this case, one should apply DCECNNs (see section 8.3.3), which describe the dynamics with an internal expectation and its deviation from the observables. Now the question arises of whether and how we can merge the diﬀerent model characteristics within a single dynamical consistent neural network (DCNN). There are two diﬀerent ways to set up this combination. In our ﬁrst approach (eq. 8.18) we keep the framework of the DCRNN (eq. 8.11), whereas the second one (eq. 8.22) is based on the DCECNN (eq. 8.14). The ﬁrst approach is based on the DCRNN. Consequently, the state vector sτ has, in the past (τ ≤ t) and future (τ ≤ t) for all τ ∈ {t − m, . . . , t + n}, the partitioning of 8.16 (see also eq. 8.12). ⎤ ⎤ ⎡ ⎡ expectations yτ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ hidden states hτ ⎢ ⎢ ⎥ (⎥ (⎥=⎢' (8.16) sτ = ⎢ ' ⎥ d ⎦ ⎣ τ ≤ t : observations ⎦ ⎣ τ ≤ t : yτ τ > t : expectations τ > t : yτ In comparison to the DCRNN (eq. 8.11) the recursion of the new model (eq. 8.18) is extended by an additional consistency matrix ⎤ ⎡ 0 0 Id ⎥ ⎢ ⎥ (8.17) C=⎢ ⎣ 0 Id 0 ⎦ −Id 0 Id

between the state vector and the transition matrix A. As we will see, this matrix ensures that the model is supplied with the information of the observables yτd as well as the error corrections eτ . We call this approach DCNN1 (eq. 8.18). The

8.3

Dynamical Consistent Neural Networks (DCNN)

221

corresponding network architecture is depicted in ﬁg. 8.9. ⎤ ⎛ ⎡ ⎤ ⎤ ⎡ ⎞ ⎡ 0 0 Id 0 Id 0 0 ⎥ ⎜ ⎢ ⎥ d ⎥ ⎢ ⎟ ⎢ ⎥ ⎜ ⎢ ⎥ ⎥ ⎟ ⎢ τ ≤ t : sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 ⎦ sτ −1 + c⎠ + ⎣ 0 ⎦ yτ −Id 0 Id Id 0 0 0 ⎤ ⎛ ⎡ ⎤ ⎡ ⎞ 0 0 Id Id 0 0 ⎥ ⎜ ⎢ ⎥ ⎢ ⎟ ⎢ ⎜ ⎥ ⎢ ⎟ s τ > t : sτ = ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 ⎥ + c τ −1 ⎦ ⎠ −Id 0 Id

Id 0 0 yτ =

[Id 0 0]

T −n

sτ ,

t+n

(yτ − yτd )2 → min A,c

t=m τ =t−m

(8.18) inner state vector

To describe how the model evolves, we explain the state equations step by step: We start with a state vector sτ −1 which has the structure of 8.16. Through the multiplication with the additional consistency matrix C the state vector is transformed into a vector with the partitioning ⎤ ⎤ ⎡ ⎡ observations yτd ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ (8.19) s˜τ = ⎢ ⎣ hτ ⎦ = ⎣ hidden states ⎦ error correction eτ for all τ ∈ {t − m, . . . , t + n}. This inner state vector s˜τ contains the observables and the error correction and combines the ideas of DCRNN and DCECNN. The rest of the recursion is identical with the DCRNN (eq. 8.11). As before, the only learnable parameters of the network are located in matrix A and the bias c.

00

Id

s t−1

A

C linear

c 0 0 Id

DCNN2

C

s t+1

0 0 Id

ydt−1

Figure 8.9

non− linear

y t+1

00

Id

yt

00

y t−1

ydt

Dynamical consistent neural network (DCNN).

As already mentioned, the second approach to a dynamical consistent neural network (DCNN2) is based on the DCECNN model (dq. 8.14). The state vector sτ

222

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

assumes the corresponding structure (see eq. 8.15): ⎤ ⎤ ⎡ ⎡ expectations yτ ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ ⎢ hidden states hτ ⎥=⎢' ( ⎥. ( ' sτ = ⎢ ⎥ ⎢ ⎥ ⎢ τ ≤ t : eτ ⎦ ⎣ τ ≤ t : error correction ⎦ ⎣ τ >t: 0 τ >t:0

(8.20)

Analogous to the development of the DCNN1 (eq. 8.18) the DCECNN equation is extended by an additional consistency matrix C, which now has the structure ⎤ ⎡ Id 0 Id ⎥ ⎢ ⎥ (8.21) C=⎢ ⎣ 0 Id 0 ⎦ . 0

0 Id

The resulting DCNN2 can be described with the following set of equations: Id

⎤

⎛ ⎡

⎤

⎡

⎜ ⎢ ⎥ ⎢ ⎥ tanh ⎜A ⎢ 0 Id τ ≤ t : sτ = ⎢ 0 Id 0 ⎝ ⎣ ⎦ ⎣ 0 0 −Id 0 0 ⎛ ⎡ ⎤ ⎡ Id 0 Id 0 0 ⎜ ⎢ ⎥ ⎢ ⎢ ⎜ ⎥ ⎢ τ > t : sτ = ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 0 0 0 0 yτ

=

[Id 0 0]

⎞

sτ ,

T −n

⎤

⎡

Id 0 Id

0 0

0

⎥ ⎥ ⎟ ⎢ ⎟ + ⎢ 0 ⎥ yτd s + c 0 ⎥ τ −1 ⎦ ⎦ ⎠ ⎣ Id Id ⎤ ⎞ Id ⎥ ⎟ ⎟ 0 ⎥ ⎦ sτ −1 + c⎠ Id

t+n

(yτ − yτd )2 → min

t=m τ =t−m

A,c

(8.22)

advantages of the DCNN

Looking at the multiplication C · sτ −1 we can easily conﬁrm that—supposing that sτ −1 is structured as in 8.20—we once again get an inner state vector s˜τ partitioned as in 8.19. This implies that the transition matrix A is applied in both models to the same inner state vector s˜τ . Consequently, although the two models look quite diﬀerent, they share an identical modeling of the dynamics. It may depend on additional modeling tools or a particular application which approach is preferable. The network architecture for the alternative approach, DCNN2, is identical to DCNN1 (ﬁg. 8.8), but note that the consistency matrices C, C≤ , and C> diﬀer. Opposite to the DCRNN (eq. 8.11) and DCECNN (eq. 8.14) the two approaches to the DCNN (eqs. 8.18 and 8.22) compute the state trajectory of the dynamics in the past exactly on the observed path. This follows from the partitioning of the inner state vector s˜τ (eq. 8.19), which is responsible for the calculation of the dynamics. It contains the observables in the ﬁrst r components, which are directly used to determine the prediction yτ . The error corrections, which are now located in the last r components, act as additional inputs. Furthermore, the DCNN oﬀers an

8.3

Dynamical Consistent Neural Networks (DCNN)

223

interesting new insight into the observation of dynamical systems: Typically, small movements of the dynamics are treated as noise, and thus the modeling focuses on larger shifts in the dynamics. Our view is diﬀerent. We believe that small system changes characterize the autonomous part of our open system, while the large swings originate at least partially from the external forces. If we neglect small system changes, we also suppress valuable substructure in our observations. We found that DCNNs allow us to model even small changes in the dynamics without losing the generalization abilities of the model. This introduces a new perspective on the structure/noise dilemma in modeling dynamical systems. 8.3.5

Partially Known Observables

So far our models have always distinguished between a past and a future development of the state equation. We assumed that in the past part (τ ≤ t) all the identiﬁed observables are available. In the future part (τ > t) we accepted that we do not know anything about the observables and hence we replaced them by the model’s own expectations. In many practical applications we have observables which are not available for all time steps in the past. In contrast, one might have observables which are also available in the future, e.g., calendar data. In the following we therefore switch from a model diﬀerentiating between past and future to a modeling structure which distinguishes between available and missing external inputs. The DCNN with partially known observables merges the two state equations of the DCNN (e.g., eq. 8.22) into one single equation that allows us to diﬀerentiate between available and unavailable observables. Consequently, it is a reformulation of the normal DCNN providing an easier and more general structure. The simpliﬁcation in one equation makes the model also more tractable for further discussions (see sections 8.4 and 8.5). The following model (eq. 8.23) is based on the DCNN2 (eq. 8.22), but an analogous model can also easily be created for DCNN1 (eq. 8.18). For all τ ∈ {t − m, . . . , t + n}, we have ⎤ ⎤ ⎛ ⎡ ⎤ ⎞ ⎡ ⎡ Id 0 Id 0 Id 0 0 ⎥ ⎜ ⎢ ⎥ E ⎥ ⎟ ⎢ ⎢ ⎥ ⎜ ⎢ ⎥ ⎥ ⎟ ⎢ sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 ⎦ sτ −1 + c⎠ + ⎣ 0 ⎦ yτ E yτ =

0

0 0

[Id 0 0]

sτ ,

0 Id

T −n

t+n

Id

(8.23)

(yτ − yτd )2 → min .

t=m τ =t−m

A,c

In this model the external inputs yτE and the included matrix E are deﬁned as follows: ' ( 0 input missing E if (8.24) yτ := input available yτd

224

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

and

' E :=

0 −1

if

(

input missing

.

input available

(8.25)

It is important to note that the inner consistency matrix C is independent of the input availability. We only adapt the consistency matrix ⎤ ⎡ Id 0 0 ⎥ ⎢ ⎥ (8.26) CE = ⎢ ⎣ 0 Id 0 ⎦ . E

0 0

The structure guarantees that an error correction is calculated in the last r state components if external input is available. Thus, we have a time-independent combination of the former two state equations (eq. 8.22). The corresponding model architecture (ﬁg. 8.10) does not change signiﬁcantly in comparison to the former (time-oriented) DCNN (ﬁg. 8.9).

00

Id

s t−1

C

A linear

c 0 0 Id

CE

Id

st

C

A linear

c

non− linear

CE

s t+1 0 0 Id

0 0 Id

yEt−1

Figure 8.10

non− linear

y t+1

00

Id

yt

00

y t−1

yEt

yEt+1

DCNN with partially known observables.

The DCNN with partially known observables is more general in the sense of observable availability and hence better applicable to real-world problems. The following discussions are mainly based on this model.

8.4

Handling Uncertainty In practical applications our models have to cope with several forms of uncertainty. So far we have neglected their possible inﬂuence on generalization performance. Uncertainty can disturb the development of the internal dynamics and seriously harm the quality of our forecasts. In this section we present several methods which reduce the model’s dependency on uncertain data. There are actually three major sources of uncertainty. First, the input data itself might be corrupted or noisy. We deal with that problem in section 8.4.1. In the framework of ﬁnitely unfolded in time recurrent neural networks we also have

8.4

Handling Uncertainty

225

the uncertainty of the initial state. We present diﬀerent approaches to overcome that uncertainty and achieve a desensitization of the model from the unknown initialization (see section 8.4.2). Finally, we discuss the uncertainty of the future inputs and question once more the assumption of a constant environment (see section 8.4.3). 8.4.1

cleaning noise

Handling Data Noise

So far we have always assumed our input data to be correct. In most practical applications this is not true. In the following we present an approach which tries to minimize input uncertainty. Cleaning noise is a method which improves the model’s learning behavior by correcting corrupted or noisy input data. The method is an enhancement of the cleaning technique which is described in detail in (Neuneier and Zimmermann, 1998). In short, cleaning considers the inputs as corrupted and adds corrections to the inputs if necessary. However, we want to keep the cleaning correction as small as possible. This leads to an extended error function Ety,x =

1 [(yt − ytd )2 + (xt − xdt )2 ] = Ety + Etx 2

→ min xt ,w

.

(8.27)

Note that this new error function does not change the usual weight adaption rule w+ = w − η

∂E y ∂w

,

(8.28)

where η > 0 is the so-called learning rate and w+ stands for the adapted weight. To calculate the cleaned input xt = xdt + ρt

(8.29)

we need the correction vectors ρt for all input data of the training set. The update rule for these corrections, initialized with ρt = 0, can be derived from typical adaption sequences: x+ t = xt − η

∂E y,x , ∂x

(8.30)

leading to ρ+ t = (1 − η)ρt − η

∂E y ∂x

.

(8.31)

This is a nonlinear version of the error-in-variables concept from statistics. y,x We derive all the information needed, especially the residual error ∂E∂x , from training the network with backpropagation (ﬁg. 8.18), which makes the computational eﬀort negligible. It is important to note that in this way the corrections are performed by the model itself and not by applying external knowledge (see “observer-observation dilemma” in Neuneier and Zimmermann, 1998). We now assume that the data is not only corrupted but also noisy. For that

226

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

reason we add an extra noise vector, −ρτ , to the cleaned value: xt = xdt + ρt − ρτ

.

The noise vector ρτ is a randomly chosen row vector matrix ⎡ ρ11 · · · · · · · · · ⎢ . ⎢ ⎢ ρ12 . . ⎢ ⎢ . CCleaning := ⎢ .. ρit ⎢ ⎢ . .. ⎢ .. . ⎣ ρ1T · · · · · · · · ·

local noise

cleaning

(8.32) {ρiτ }i=1,... ,r of the cleaning ρr1 ρr2 .. . .. .

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎦

ρrT

which stores the input error corrections of all data patterns. The matrix has the same size as the pattern matrix, as the number of rows equals the number of patterns T and the number of columns equals the number of inputs r. One might wonder why disturb the cleaned input xt = xdt + ρt with an additional noise-term −ρτ . The reason for this is, that we want to beneﬁt from representing the whole input distribution to the network instead of only using one particular realization (Zimmermann and Neuneier, 1998). A variation on the Cleaning Noise method is called local cleaning noise. Cleaning noise adds to every training pattern the same noise term −ρτ and therefore assumes that the noise of the diﬀerent inputs is correlated. Especially in highdimensional models it is improbable that all the components of the input vector follow an identical or at least correlated noise distribution. For these cases we propose a method which is able to diﬀerentiate component-wise: xit = xdt + ρt − ρiτ .

(8.33)

In contrast to the normal cleaning technique, the local version is correcting each component of the input vector xit individually by a cleaning correction and a randomly taken entry ρiτ of the corresponding column {ρit }t=1,... ,T of the cleaning matrix CCleaning . A further advantage of the local cleaning technique is that—with the increased number of (local) correction terms (T · r)—we can cover higher dimensions. In contrast, with the normal cleaning technique the dimension is bounded by the number of training patterns T , which can be insuﬃcient for high-dimensional problems. 8.4.2

Handling the Uncertainty of the Initial State

One of the diﬃculties with ﬁnite unfolding in time is to ﬁnd a proper initialization for the ﬁrst state vector of the recurrent neural network. An obvious solution is to set the ﬁrst state s0 to zero. We then implicitly assume that the unfolding includes

enough (past) time steps such that the misspeciﬁcation of the initialization phase is compensated along the state transitions. In other words, the network accumulates information over time, and thus can eliminate the impact of the arbitrary initial state on the network outputs. The model can be improved if we make the unfolded recurrent network less sensitive to the unknown initial state s0 . For this purpose we look for an initialization, for which the interpretation of the state recursion is consistent over time. Since the initialization procedure is identical for all types of DCNNs, we demonstrate the approach on the DCNN with partially known observables (eq. 8.23): ⎤ ⎡ 0 ⎥ E ⎢ ⎥ sτ = CE tanh(A · C · sτ −1 + c) + ⎢ ⎣ 0 ⎦ yτ Id T −n

yτ = [Id 0 0]sτ ,

t+n

(8.34)

(yτ − yτd )2 → min . A,c

t=m τ =t−m

In a ﬁrst step we explicitly integrate a ﬁrst state vector s0 (see ﬁg. 8.11). This target of the ﬁrst output is no longer set to zero but receives the target information yt−1 vector yt−1 . The vector is then multiplied by the consistency matrix CE , such that the ﬁrst r components of the ﬁrst state vector st−1 coincide with the ﬁrst expected output. This avoids the generation of an excessively large error for the ﬁrst output.

y t−1 Id

s0

CE

yt

00

initializa-

227

s t−1

Id

C

A linear

c Id 0 0

ytar t−1

Figure 8.11 state s0 .

0 0 Id

yEt−1

y t+1

00

state tion

Handling Uncertainty

non− C E

Id

st

linear

C

A linear

c

non− C E linear

00

8.4

s t+1

0 0 Id

yEt

Time consistent initialization of a DCNN with an additional initial

The hidden states of this model are arbitrarily initialized with zero. In a second step we add a noise term ε to the ﬁrst state vector s0 to stiﬀen the model against the uncertainty of the unknown initial state. A ﬁxed noise term ε that is drawn from a predetermined noise distribution is clearly inadequate to handle the uncertainty of the initial state. Instead we apply—according to the cleaning noise method—an adaptive noise term, which ﬁts best the volatility of the unknown initial state s0 . As explained in section 8.4.1, the characteristics of the adaptive noise term are automatically determined as a by-product of the error backpropagation algorithm.

228

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

residual error

The basic idea is as follows: The residual error ρ as measured at the initial state s0 can be interpreted as the uncertainty stemming from missing information about the true initial state vector. If we disturb s0 with a noise term which follows the distribution of the residual error of the network, we diminish the uncertainty about the unknown initial state during system identiﬁcation. In addition, this allows a better ﬁtting of the target values over the training set. A corresponding network architecture is depicted in ﬁg. 8.12.

Id 0

s0

noise

CE

00

Id

s t−1

Id

C

A linear

c 0 0 Id

Id 0 0

ytar t−1

Figure 8.12

initialization techniques

yEt−1

non− linear

y t+1

00

0

yt

CE

Id

st

C

A linear

c

non− linear

CE

00

y t−1

s t+1

0 0 Id

yEt

Desensitization of a DCNN from the unknown initial state s0 .

Technically, noise is introduced into the model via an additional input layer. The dimension of noise is equal to that of the internal state. The input values are ﬁxed at zero over time. Due to the incomplete identity matrix between noise and the initial state the noise is only applied to the hidden values of the initial state, where no input information is available. The desensitization of the network from the initial state vector s0 can therefore be seen as a self-scaling stabilizer of the modeling. Note that the noise term ρ is drawn randomly from the observed residual errors, without any prior assumption on the underlying noise distribution. In general, a discrete-time state trajectory forms a sequence of points over time. Such a trajectory is comparable to a thread in the internal state space. The trajectory is very sensitive to the initial state vector s0 . If we apply noise to s0 , the space of all possible trajectories becomes a tube in the internal state space (ﬁg. 8.13). Due to the characteristics of the adaptive noise term, which decreases over time, the tube contracts. This enforces the identiﬁcation of a stable dynamical system. Consequently, the ﬁnite volume trajectories act as a regularization and stabilization of the dynamics. The question arises, what may be the best method to create an appropriate noise level. Table 8.1 gives an overview of several initialization techniques we have developed and examined so far. Remember that in all cases the corrections are only applied to the hidden variables of the initial state s0 . We already explained the ﬁrst three methods in section 8.4.1. The idea behind the initialization with start noise is, that we do not need a cleaning correction but solely focus on the noise term. double start noise tries to achieve a nearly symmetrical noise distribution, which is also double in comparison to normal start

8.4

Handling Uncertainty

229

st

steps of unfolding in time

Figure 8.13

Creating a tube in the internal state space by applying noise to the

initial state. Table 8.1

Overview of Initialization Techniques Cleaning:

s0

=

0 + ρt

Cleaning noise:

s0

=

0 + ρt − ρτ

Local Cleaning noise:

s0i

=

0 + ρ t − ρ τi

Start noise:

s0

=

0 + ρτ

Local start noise:

s0i

=

0 + ρ τi

Double start noise:

s0

=

0 + (ρ1τ − ρ2τ )

Double local start noise:

s0i

=

0 + (ρ1τi − ρ2τi )

noise. In all cases local always corresponds to the individual application of a noise term to each component of the initial state s0 (see local cleaning noise in section 8.4.1). From top to bottom the methods listed in table 8.1 use less and less information about the training set. Hence double start noise emphasizes more the generalization abilities of the model. This is also conﬁrmed by our experiments. Furthermore we could conﬁrm that the local initialization techniques lead to better performance in high-dimensional models (see section 8.4.1). 8.4.3

Handling the Uncertainty of Unknown Future Inputs

In the past part of the network, the inﬂuence of the unknown externals is reﬂected in the error corrections as calculated by the backpropagation algorithm. In the future part we do not have any information about the correctness of our inputs. As explained in section 8.3, we either use our own forecasts as future inputs, or simply assume that the inputs in the future stay constant. The underlying assumption is that the observables evolve in the future like they did in the past. We cannot verify if this is correct. Anyway, for most practical applications it is a very questionable assumption.

230

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

To stabilize our model against these uncertainties of the future inputs we apply a Gaussian noise term εt+τ to the last r components of each future state vector st+τ . The corresponding architecture is depicted in ﬁg. 8.14.

Id 0

s0

noise

CE

00

Id

s t−1

Id

C

A linear

c Id 0 0

ytar t−1

0 0 Id

yEt−1

non− linear

y t+1

00

0

yt

CE

Id

st

C

A linear

c 0 0 Id

non− linear

CE

00

y t−1

s t+1 0 0 Id

yEt

ε t+1

Figure 8.14 Handling the uncertainty of the future inputs by adding a noise term εt+τ to each future state vector st+τ .

The additional noise is used during the training of the model to achieve a more stable output. For the actual deterministic forecast we either skip the application of noise to avoid a disturbance of the predictions or average our results over a suﬃcient number of diﬀerent forecasts (Monte Carlo approach).

8.5

Function and Structure in Recurrent Neural Networks

superposition and conservation of information

Our discussion about function and structure in recurrent neural networks is focused on the autonomous part of the model, which is mapped by the internal state transition matrix A. So far the transition matrix A has always been assumed to be fully connected. In a fully connected matrix the information of a state vector st is processed using the weights in A to compute st+1 . This implies that there is a high proportion of superposition (computation) but hardly any conservation of information (memory) from one state to a succeeding one (see the right panel of ﬁg. 8.15). For the identiﬁcation of dynamical systems such memory can be essential, as information may be needed for computation in subsequent time steps. A shift register (see the left panel of ﬁg. 8.15) is a simple example for the implementation of memory, as it only transports information within the state vector s. No superposition is performed in this transition matrix. At ﬁrst view we have two contradicting functions: superposition and conservation of information. Superposition of information is necessary to generate or adapt changes of the dynamics. In contrast, conservation of information causes memory eﬀects by transporting information more or less unmodiﬁed to a subsequent state neuron. In this context, memory can be deﬁned as the average number of state transitions necessary to transmit information from one state neuron to any other

8.5

Function and Structure in Recurrent Neural Networks

231

s t+3 A s t+2 memory = conservation of information

0 1 1

a ij = 0

1 0

computation = superposition of information

s t+1 A st Figure 8.15 Function and structure in dynamical systems: computation versus memory in the transition matrix A.

one in a subsequent state. We call this number of necessary state transitions the path length of a neuron. To overcome the apparent dilemma between superposition and conservation of information the transition matrix A needs a structure which balances memory and computation eﬀects. Sparseness of the transition matrix reduces the number of paths and the computation eﬀect of the network but at the same time increases the average path length, and therefore allows for longer-lasting memory. A possible solution is an inﬂation of the recurrent network, i.e., of the transition matrix A. We show that with such an inﬂation an optimal balance between memory and computation can be achieved (section 8.5.1). In this context we present conjectures about the optimal level of sparseness and the required minimum dimension. An experiment with artiﬁcial data underlines our results (section 8.5.2). In section 8.5.3 we conclude that sparseness is actually an essential condition for high-dimensional neural networks. Finally, we discuss in section 8.5.4 the information ﬂow in sparse networks. 8.5.1

Inﬂation of Recurrent Neural Networks

Based on the length of the past unfolding m (see section 8.2.2) and the optimal state dimension dim(s) of a fully connected recurrent network, we can deﬁne a procedure for an optimal design of the neural network structure, which solves the dilemma between memory and computation. The idea is to inﬂate the network to a higher dimensionality, while maintaining the computational complexity of the former lower-dimensional and fully connected network, and at the same time allowing for memory eﬀects. With an inﬂated transi-

232

optimal inﬂation

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

tion matrix A we can optimize both superposition and conservation of information. To determine the optimal dimension and the level of sparseness, we propose two conjectures, which we will empirically investigate in section 8.5.2. In a ﬁrst step we calculate the new dimension of the internal state s by dim(snew ) := m · dim(s).

(8.35)

As the former dimension of s was supposed to be optimal, we have to ensure that the higher-dimensional network has the same superposition of information as the original one. This can be achieved by keeping the number of active weights constant. On average we want to have the same number of nonzero elements as in the former lower-dimensional network. Thus, the sparseness level of the new matrix Anew is given by dim(s) 1 initialize Anew with Random = Random . (8.36) dim(snew ) m

training procedure for RNNs

Hereby Random(·) represents the percentage of randomly initialized weights, whereas the remaining weights are set to zero. Proceeding this way, we replicate on average the computation eﬀect of the former network. At the same time we increase the path lengths (memory) with the sparseness level of the new transition matrix Anew . Note that the sparseness level only depends on the length of the past unfolding m. The conjecture (eq. 8.36) implies that the sparseness of Anew is generated randomly. In section 8.5.3 we present techniques which try to optimize the sparse structure and consequently the memory and computation abilities of the network. Based on our conjectures about inﬂation, a proper training procedure for recurrent neural networks should consist of four steps: First, one has to set up an appropriate network architecture (e.g., DCNN, eq. 8.23). Second, the length of the past unfolding m and the optimal internal state dimension dim(s) of the system have to be estimated by analyzing the network errors along the time steps of the unfolding (see section 8.2.2). Third, we use the estimated parameters m and dim(s) to determine the optimal dimensionality and sparseness (eqs. 8.35 and 8.36). Fourth, the inﬂated network is trained until convergence by backpropagation through time using, e.g., the vario-eta learning rule (Neuneier and Zimmermann, 1998). 8.5.2

Experiments: Testing Conjectures About Inﬂation

In the following experiments we want to evaluate our conjectures about optimal inﬂation of recurrent networks. To ensure a straight analysis of our proposed equations (eqs. 8.35 and 8.36), we modeled an artiﬁcial network which consists of an

Function and Structure in Recurrent Neural Networks

233

autonomous development only. We applied the network to forecast the development of the following artiﬁcial data generation process st = tanh(A · st−m ) + t ,

(8.37)

where dim(s) = 5, A is randomly initialized, m = 3, and is white noise with σ = 0.2, one time step ahead. The unfolding in time of the recurrent network includes ﬁve time steps from t−3 to t+1. As the data generation process is a closed dynamical system, there are no inputs, but time-delayed states st−k (k = 1, . . . , m) are used as external inﬂuences. Each of the following experiments is based on 100 Monte Carlo simulation runs. For each run we generated 1000 observations, 25% for training and 75% for testing purposes. The network was trained until convergence with error backpropagation through time using vario-eta learning (Neuneier and Zimmermann, 1998). First we evaluated our conjecture about the sparse random initialization of the transition matrix Anew . For this purpose we randomly initialized matrix Anew with diﬀerent levels of sparseness (100 test runs per sparseness degree). The dimension of the internal state was ﬁxed according to 8.35 at dim(snew ) = 3 · 5 = 15 for all test runs. The mean square error of the network measured on the test set was used as a performance criterion.

−3

5.9

x 10

5.8

generalization error

8.5

5.7

5.6

5.5

5.4

5.3 0

10

20

30

40

50

60

70

80

90

100

% of random initialization

Figure 8.16

Eﬀects of diﬀerent degrees of sparseness on matrix Anew .

The results of the experiment (Figure 8.16) conﬁrm our conjecture about an optimal sparseness level: If we initialize matrix Anew randomly 35% sparse, we observe the best performance (i.e., lowest average error on the test set). This corresponds to equation 8.36, where an optimal sparseness level computes to 33%.

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

The second series of experiments was connected with the optimal internal state dimension. During these experiments we kept the sparseness level constant at 33% (see eq. 8.36), whereas the dimension of the internal state was variable. We performed 100 runs for each dimension of the internal state with diﬀerent random initializations of matrix Anew . Again we used the network error as an indicator of the model performance. The results are shown in ﬁg. 8.17.

−3

6.1

x 10

6

5.9

generalization error

234

5.8

5.7

5.6

5.5

5.4

5.3

5.2 5

10

15

20

25

30

35

40

45

50

dimension of internal state

Figure 8.17

Impacts of diﬀerent internal state dimensions dim(snew ).

It turns out that the best performance is achieved if the dimension of the internal state is equal to dim(snew ) = 18. Our conjecture of dim(snew ) = 3 · 5 = 15 (eq. 8.35) slightly underestimates the empirically measured optimal dimensionality. However, because of the noise term t , we suppose that the optimal dimension of the system is larger than 5. This indicates that our conjecture in 8.35 is a helpful estimate of an optimal level of sparseness. Both experiments show that mismatches between dimensionality and sparseness cause problems in the function (superposition and conservation) of the transition matrix. In other words, an unbalanced parameterization of the inﬂation leads to lower generalization performance of the network. 8.5.3

Sparseness as a Necessary Condition for Large Systems

One might come up with the idea of initializing a model with a fully connected transition matrix A, and then pruning it during the learning process until a desired degree of sparseness is reached. This approach is misleading, as sparseness is an essential condition for the performance of the backpropagation algorithm in large networks.

8.5

Function and Structure in Recurrent Neural Networks

235

target dev2 = out2 − target

out2 = f(netin2 )

Output δ 2 = f’(netin2 ) dev2

netin 2= A out1 d Et = δ 2 out T1 dA

dev1 = AT δ 2

out1 = f(netin1 )

Hidden δ 1 = f’(netin1 ) dev1

netin 1= A out0 d Et = δ 2 out T1 dA

dev0 = AT δ 1

out 0 = input

Input

input x

Figure 8.18

δE δx

Forward and backward information ﬂow in the backpropagation

algorithm. backpropagation

sparse initialization of A

Figure 8.18 shows the forward and backward information ﬂow in the backpropagation algorithm. 3 When we look at the calculations, it becomes obvious that if the transition matrix A is fully connected and dim(s) is increasing, we get a growing number as well as a lengthening of the sums in the matrix times vector operations. Due to the law of large numbers the probability for large sum values also increases. This does not pose any problems in the forward ﬂow of the algorithm. The hyperbolic tangent as the nonlinear activation function guarantees that the calculated values stay numerically tractable. In contrast, backward information ﬂow is linear. In this part of the algorithm large values are spread all over a fully connected matrix A. They quickly sum up to values which cause numerical instabilities and may destroy the whole learning process. This can be avoided if we use a sparse transition matrix A. The number of summands is then smaller and therefore the probability of large sums is low. In the remainder of this section we want to discuss the question of how to choose a sparse transition matrix A that is still trainable to a stable model. One intuitive answer is to initialize the model several times and then compare the diﬀerent results. We performed 100 test runs of the neural network using diﬀerent random initializations of matrix A. The prestructuring of the network followed our conjectures about inﬂation (eqs. 8.35 and 8.36). An obvious approach to overcome the uncertainty of the random initialization is to pick the best-performing network out of the 100 test runs. However, it is

236

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

not clear if 100 test runs are eﬀectively required to ﬁnd an appropriate model. To study how many test runs are needed to ﬁnd a good solution with a minimum of computational eﬀort, we picked all possible subsets of k = 1, 2, . . . , 100 solutions out of the 100 test runs. From each subset we chose the solution with the lowest error on the test set and computed the average performance: 1 min(ei1 , ei2 , . . . , eik ). (8.38) Ek = 100 k

i1 ≤...≤ik

The resulting average error curve for k = 1, 2, . . . , 100 is depicted in ﬁg. 8.19 (solid line).

−3

x 10

5.5

5 0

Figure 8.19

coverage percentage

1

0.5

10

20

30

40 50 60 Number of Test Runs

70

80

90

c(n)

Average Error (test set)

6

0 100

Estimating the number of random initializations for matrix A.

As can be seen from the error curve in ﬁg. 8.19, an appropriate solution can be obtained on the average by choosing the best model out of a subset of 10 networks (vertical dotted line). Of course, the performance is worse than picking the best model out of 100 solutions. However, the additional computational eﬀort does not justify the small improvement of performance. As an apparently easier guideline to determine the number of required test runs, we choose the number k such that the so-called coverage percentage, k 1 , (8.39) c(k) = 1 − 1 − m is close to 1. The idea behind the coverage percentage c(k) in 8.39 is that the ﬁrst inﬂated network covers c(1) = 1/m active elements in the internal transition matrix Anew . Assuming that we have c(k), the next initialization covers another percentage of the weights in the transition matrix, resulting in c(k+1) = c(k)+(1/m)·(1−c(k)). The coverage percentage c(k) for the diﬀerent numbers of initializations is also

8.5

Function and Structure in Recurrent Neural Networks

pruning creation

&

re-

237

reported in ﬁg. 8.19 (dashed line). A number of k = 10 random initializations already leads to a coverage of c(10) ≈ 0.983. To further reduce the computational eﬀort, we developed a more sophisticated approach, a process we call pruning and re-creation of weights. As described in section 8.5.1, we initialize matrix A with a sparseness level of Random(1/m) (eq. 8.36). The idea is now to optimize the initial sparse structure by alternating weight pruning and re-creation. Using this method, matrix A is always sparse and the number of active weights stays constant. The network still gets the opportunity to replace active weights by initially inactive ones that it considers more important for the identiﬁcation of the dynamics. For the ﬁrst step, the weight pruning, we use a test criterium similar to optimal brain damage (OBD) (LeCun et al., 1990): testw (w = 0) =

∂2E 2 w . ∂w2

(8.40)

We prune a certain percentage (e.g., 5%) of the lowest values, as these weights w are assumed to be less important for the identiﬁcation of the dynamics. To simplify our calculations we use 1 2 ∂2E ≈ g , (8.41) ∂w2 T t t t with gt := ∂E ∂w , as an approximation for the second derivative. Our simulations showed that this equivalence holds for a 95% level. In the second step, the re-creation of inactive weights, we use the following test: 1 (8.42) gt . testw (w = 0) ∼ T

t

We reactivate the weights w with the highest test values. This implies that we recover weights whose average of the absolute gradient information is high and which are therefore considered important for the identiﬁcation of the dynamics. Note that we always re-create the same amount of weights we pruned in the ﬁrst step to keep the sparseness level of the transition matrix A constant. Our experiments showed that we can even prune and re-create weights simultaneously without losing modeling ability. 8.5.4

Information Flow in Sparse Recurrent Networks

In small networks with a full transition matrix A the information of a state neuron can reach every other one within one time step. This is diﬀerent in (large) sparse networks, where state neurons have a longer path length on average. As matrix A is sparse there is in most cases no direct connection between diﬀerent state neurons. Hence, it can take several state transitions to transport information from one state neuron to another. As information might not reach a

desired neuron in a limited number of time steps, this can be disadvantageous for the modeling ability of the network. The resulting question is, how we can speed up the information ﬂow, i.e., shorten the path length? In a simple recurrent network (e.g., eq. 8.9) the transition matrix A is applied once in every state transition. The idea is now to reduce the average path length with at least one additional undershooting step (Zimmermann and Neuneier, 2001). Undershooting means that we implement intermediate states sτ ± 12 which improve the computation of the network (ﬁg. 8.20). These intermediate states have no external inputs. Like future unfolding time steps, they are only responsible for the development of the dynamics and therefore also improve numerical stability.

00

c

s t−2

Id

A

s t−3/2

A

Id

s t−1

A

s t−1/2

A

Id

st

0 0 Id

ut−2

y t+2

y t+1

A

s t+1/2

A

s t+1

c

c

c 0 0 Id

yt

00

Id

y t−1 00

y t−2

00

undershooting

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

Id

A

s t+3/2

A

00

238

s t+2

c

0 0 Id

ut−1

ut

Undershooting improves the computation of a sparse matrix A and the numerical stability of the model.

Figure 8.20

The following formula gives a rough approximation of how many undershooting steps k are needed (eq. 8.43). As the state transition matrix A in the inﬂated network has, per deﬁnition (eq. 8.36), a sparseness of 1/m, in each time step every state neuron only gets the information of approximately 1/m others. The equation now determines the number of undershooting steps which are needed to achieve a desired average path length (information ﬂow) between all state neurons. The kth power of the product of the sparseness factor 1/m and the dimension of the state vector dim(s) must be higher than the number of state neurons ( = dim(s)): k 1 dim(s) ≥ dim(s) m 1 ⇒ k ≥ (8.43) log(m) 1 − log(dim(s)) ⇒ undershooting with DCNN

k ≥ 1+

log(m) . log(dim(s))

Let us reconsider the equations of the DCNN with partially known observables

8.6

Conclusion

239

(eq. 8.23): ⎤

⎡ 0

⎥ E ⎢ ⎥ yτ sτ = CE tanh (A · C · sτ −1 + c) + ⎢ 0 ⎦ ⎣ Id yτ = [Id 0 0]sτ , (yτ − yτd )2 → min .

(8.44)

A,c

t,τ

Following the principle of undershooting, we add a state sτ − 12 between the states sτ −1 and sτ (ﬁg. 8.21). Consequently, the matrix A is now applied twice between two consecutive time steps, which implies that the information ﬂow is doubled. The consistency matrix CE handles the lack of external inputs, such that the network stays dynamical consistent.

y t−1 00

s t−1

Id

C

A linear

c 0 0 Id

yEt−1

non− linear

CE

s t−1/2

C

A linear

c

non− linear

00

Id

yt

CE

st 0 0 Id

yEt

Undershooting doubles the information ﬂow between two successive states by applying the transition matrix twice.

Figure 8.21

It is important to note that the solution is diﬀerent from just decreasing the sparseness of the matrix A. The latter would not only cause numerical problems in the backpropagation algorithm but also disturb the balance between memory and computation.

8.6

Conclusion In this chapter we focused on dynamical consistent neural networks (DCNN) for the modeling of open dynamical systems. After a short description of small recurrent networks, including error correction neural networks, we presented a new kind of dynamical consistent neural networks. These networks allow an integrated view on particular modeling problems and consequently show better generalization abilities. We concentrated the modeling of the dynamics on one single-transition matrix and also enhanced the model from a simple statistical to a dynamical consistent handling of missing input information in the future part. The networks are now able to map

240

world model

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

integrated system dynamics (e.g., ﬁnancial markets) instead of only a small set of time series. The ﬁnal DCNN combines the advantages of the former RNN and the ECNN. Besides the new, more powerful architectures, the modeling involves a paradigm shift in the analysis of open systems (see ﬁg. 8.1). In the beginning we looked at the description of a dynamical system from an exterior point of view. This means that we observed the information ﬂow into and out of an open system and tried to reconstruct the interior. The long-term predictability of the model ﬁnally depended on the quality of the extracted autonomous subsystem. In our new approach we describe dynamical systems from an interior viewpoint. Conceptually we start with a world model (ﬁg. 8.22). Without loss of generalization, we assume that our variables of interest are all organized as the ﬁrst elements in a large state vector. Identifying this ﬁrst section of the state vector as our observables (yt ), we can reconstruct some more unobservable states (ht ) by their indirect inﬂuence on the observables. Nevertheless, there are an inﬁnite number of variables which are unobservable and even unidentiﬁable. Their nearly inﬁnite inﬂuence can be shrunk to a ﬁnite dimensional section in the state vector: the error correction part (et ). If the error correction is equal to zero, knowledge about the unidentiﬁable variables is not necessary. Otherwise, it dispenses us from having to know the details of the unknown part of the world. Clearly, the concept of a world model is a closed one from the beginning. As a consequence of dynamical consistency the closure concept even holds for ﬁnitedimensional subsections of it. Therefore it models the dynamics as a closed system and is still able to keep model evolution exactly on the observed state trajectory (see DCNN2, eq. 8.22).

yt

ht

et

Variable space of the world model. yt stands for the observables, ht for the hidden variables which can be explained by the observables, and et for the error corrections, which close the gap between the observable and the unobservable part of the system. Figure 8.22

NOTES

241

We augmented the model-building process by incorporating prior knowledge. Learning from data is only one part of this process. The recurrent ECNN and the DCNN are two examples of this model-building philosophy. Remarkably, such a joint model-building framework does not only provide superior forecasts, but also a deeper understanding of the underlying dynamical system. On this basis it is also possible to analyze and to quantify the uncertainty of the predictions. This is especially important for the development of decision support systems. Currently we test our models in several industrial applications. Further research is conducted concerning the optimal sparseness of the transition matrix A as well as the optimal initialization method for the state vector. Acknowledgment We thank J. Zwierz for the calculations and tests during our experiments in section 8.5. The extensive work performed by M. Pellegrino in proof reading this chapter is gratefully acknowledged. The computations were performed on our neural network modeling software SENN (Simulation Environment for Neural Networks), which is a product of Siemens AG.

Notes

1 For

other cost functions see Neuneier and Zimmermann (1998). an overview of algorithmic methods see Pearlmutter (2001) and Medsker and Jain (1999). 3 For further details, the reader is referred to Haykin (1994) and Bishop (1995). 2 For

9

Diversity in Communication: From Source Coding to Wireless Networks

Suhas Diggavi

Randomness is an inherent part of network communications. We broadly deﬁne diversity as creating multiple independent instantiations (conduits) of randomness for conveying information. In the past few years a trend is emerging in several areas of communications, where diversity is utilized for reliable transmission and eﬃciency. In this chapter, we give examples from three topics where diversity is beginning to play an important role.

9.1

Introduction One of the main characteristics of network communication is the uncertainty (randomness): randomness in users’ wireless transmission channels, randomness in users’ geographical locations in a wireless network, and randomness in route failures and packet losses in networks. The randomness we study in this chapter can have timescales of variation that are comparable to the communication transmission times. This can result in complete failures in communication and therefore aﬀect reliability. Such “nonergodic” losses can be combated if we somehow create independent instantiations of the randomness. We broadly deﬁne diversity as the method of conveying information through such multiple independent instantiations. The overarching theme of this chapter is how to create diversity and how we can use it as a tool to enhance performance. We study this idea through diversity in multiple antennas, multiple users, and multiple routes. The functional modularities and abstractions of the network protocol known as stack layering (Keshav, 1997) contributed signiﬁcantly to the success of the wired Internet infrastructure. The layering achieves a form of information hiding, providing only interface information to higher layers, and not the details of the implementation. The physical layer is dedicated to signal transmission, while the data-link layer implements functionalities of data framing, arbitrating access to

244

Diversity in Communication: From Source Coding to Wireless Networks

transmission medium and some error control. The network layer abstracts the physical and data-link layers from the upper layers by providing an interface for end-to-end links. Hence, the task of routing and framing details of the link layer are hidden from the higher layers (transport and application layers). However, as we will see, the use of diversity necessarily causes cross-layer interactions. These cross-layer interactions form a subtext to the theme of this chapter. Wireless communication hinges on transmitting information riding on radio (electromagnetic) waves, and hence the information undergoes attenuation eﬀects (fading) of radio waves (see section 9.2 for more details). Such multipath fading is a source of randomness. Here diversity arises by utilizing independent realizations of fading in several domains, time (mobility), frequency (delay spread), and space (multiple antennas). Over the past decade research results have shown that multiple-antenna spatial diversity (space-time) communication can not only provide robustness, but also dramatically improve reliable data rates. These ideas are having a huge impact on the design of physical layer transmission techniques in next-generation wireless systems. Multiple-antenna diversity is the focus of section 9.3. The wireless communication medium is naturally shared by several users using the same resources. Since the users’ locations (and therefore their transmission conditions) are roughly independent, they experience independent randomness in local channel and interference conditions. Diversity in this case arises by utilizing the independent transmission conditions of the diﬀerent users as conduits for transmitting information i.e., multi-user diversity. This can be utilized in two ways. One by allowing users access to resources when it is most advantageous to the overall network. This is a form of opportunistic scheduling and is examined in section 9.4.1. The other by using the users themselves as relays to transmit information from source to destination. This is a form of opportunistic relaying, and is studied in section 9.4.2. These multi-user diversity methods are the focus of section 9.4. In transmission over networks, random route failures and packet losses degrade performance. Diversity here would be achieved by creating conduits with independent probability of route failures. For example, this can be done by transmission over multiple routes with no overlapping links. A fundamental question that arises is how we can best utilize the presence of such route diversity. In order to utilize these conduits, multiple description source coding generates multiple codeword streams to describe a source (such as images, voice, video, etc.). The design goal is to have a graceful degradation in performance (in terms of distortion) when only subsets of the transmitted streams are received. In section 9.5 we study fundamental bounds and design ideas for multiple description source coding. Therefore, diversity not only plays a role in robustness, it can also result in remarkable gains in achievable performance over several disparate applications. The details of how diversity enhances performance are discussed in the sequel.

9.2

Transmission Models

245

Reflector

Reflector

Mobile Node

Base Station

Reflector

Reflector

Figure 9.1

9.2

Radio propagation environment.

Transmission Models Since a considerable part of this chapter is about wireless communication, it is essential to understand some of the rudiments of wireless channel characteristics. In this section, we focus on models for point-to-point wireless channels and also introduce some of the basic characteristics of transmission over (wireless) networks. Wireless communication transmits information by riding (modulation) on electromagnetic (radio) waves with a carrier frequency varying from a few hundred megahertz to several gigahertz. Therefore, the behavior of the wireless channel is a function of the radio propagation eﬀects of the environment. A typical outdoor wireless propagation environment is illustrated in ﬁg. 9.1, where the mobile wireless node is communicating with a wireless access point (base station). The signal transmitted from the mobile may reach the access point directly (line-of-sight) or through multiple reﬂections on local scatterers (buildings, mountains, etc.). As a result, the received signal is aﬀected by multiple random attenuations and delays. Moreover, the mobility of either the nodes or the scattering environment may cause these random ﬂuctuations to vary with time. Time variation results in the random waxing and waning of the transmitted signal strength over time. Finally, a shared wireless environment may incur interference (due to concurrent transmissions from other mobile nodes) to the transmitted signal. The attenuation incurred by wireless propagation can be decomposed in three main factors: a signal attenuation due to the distance between communicating nodes (path loss), attenuation eﬀects due to absorption in local structures such as buildings (shadowing loss), and rapid signal ﬂuctuations due to constructive and destructive interference of multiple reﬂected radio wave paths (fading loss). Typically the path loss attenuation behaves as 1/dα as a function of distance d, with α ∈ [2, 6]. More detailed models of wireless channels can be found in Jakes (1974) and Rappaport (1996).

246

Diversity in Communication: From Source Coding to Wireless Networks

ν taps

Figure 9.2

9.2.1

Mt

Mr

Transmit

Receive

Antennas

Antennas

MIMO channel model.

Point-to-Point Model

For the purposes of this chapter we start with the following model:

yc (t) = hc (t; τ )s(t − τ )dτ + z(t) ,

(9.1)

where the transmitted signal s(t) = g(t)∗x(t) is the convolution of the informationbearing signal x(t) with g(t), the transmission shaping ﬁlter, yc (t) is the continuous time received signal, hc (t; τ ) is the response at time t of the time-varying channel if an impulse is sent at time t−τ , and z(t) is the additive Gaussian noise. The channel impulse response (CIR) depends on the combination of all three propagation eﬀects and in addition contains the delay induced by the reﬂections. To collect discrete-time suﬃcient statistics1 of the information signal x(t) we need to sample (9.1) faster than the Nyquist rate2 . Therefore we focus on the following discrete-time model: y(k) = yc (kTs ) =

ν

h(k; l)x(k − l) + z(k) ,

(9.2)

l=0

where y(k), x(k), and z(k) are the output, input, and noise samples at sampling instant k, respectively, and h(k; l) represents the sampled time-varying channel impulse response of ﬁnite length ν. Modeling the channel as having a ﬁnite duration can be made arbitrarily accurate by appropriately choosing the channel memory ν. Though the channel response {h(k; l)} depends on all three radio propagation attenuation factors, in the timescales of interest the main variations come from the small-scale fading which is well modeled as a complex Gaussian random process. Since we are interested in studying multiple-antenna diversity, we need to extend the model given in equation 9.2 to the multiple transmit (Mt ) and receive (Mr ) antenna case. The multi-input multi-output (MIMO) model is given by y(k) =

ν l=0

H(k; l)x(k − l) + z(k) ,

(9.3)

9.2

Transmission Models

247

mnH(b) Block b T

mnH(b+1) Block b+1

Block time−invariant transmission frames Figure 9.3

Block time-invariant model.

where the Mr × Mt complex3 matrix H(k; l) represents the lth tap of the channel matrix response with x ∈ C Mt as the input and y ∈ C Mr as the output (see ﬁg. 9.2). The variations of the channel response between antennas arises due to variations in arrival directions of the reﬂected radio waves (Raleigh et al., 1994). The input vector may have independent entries to achieve high throughput (e.g., through spatial multiplexing) or correlated entries through coding or ﬁltering to achieve high reliability (better distance properties, higher diversity, spectral shaping, or desirable spatial proﬁle; see section 9.3). Throughout this chapter, the input is assumed to be zero mean and to satisfy an average power constraint, i.e., E[||x(k)||2 ] ≤ P . The vector z ∈ C Mr models the eﬀects of noise and is assumed to be independent of the input and is modeled as a complex additive circularly symmetric Gaussian vector with z ∼ C N (0, Rz ), i.e., a complex Gaussian vector with mean 0 and covariance Rz . In many cases we assume white noise, i.e., Rz = σ 2 I. Finally, the basic point-to-point model given in equation 9.3 can be modiﬁed for an important special case. Many of the insights can be gained for the ﬂat fading channel where we have ν = 0 in equation 9.3. Unless otherwise mentioned, we will use this special case for illustration throughout this chapter. Also we examine the case where we transmit a block or frame of information. Here we encounter another important modeling assumption. If the transmission block is small enough so that the channel time variation within a transmission block can be neglected, we have a block time-invariant model. Such models are quite realistic for transmission blocks of lengths less than a millisecond and typical channel variation bandwidths. However, this does not imply that the channel remains constant during the entire transmission. Transmission blocks sent at various periods of time can experience diﬀerent (independent) channel instantiations (see ﬁg. 9.3). This can be utilized by coding across these diﬀerent channel instantiations, as will be seen in section 9.3. Therefore, if the transmission block is of length T , for the ﬂat-fading case, the specialization of equation 9.3 yields Y(b) = H(b) X(b) + Z(b) ,

(9.4)

where Y(b) = [y(b) (0), . . . , y(b) (T − 1)] ∈ C Mr ×T is the received sequence, H(b) ∈ C Mr ×Mt is the block time-invariant channel fading matrix for transmission block b, X(b) = [x(b) (0), . . . , x(b) (T − 1)] ∈ C Mt ×T is the “space-time” information transmission sequence, and Z(b) = [z(b) (0), . . . , z(b) (T − 1)] ∈ C Mr ×T .

248

Diversity in Communication: From Source Coding to Wireless Networks

(X ,Y ) 2

2

(X ,Y ) n

n

(X ,Y ) 1

1

(X ,Y ) 3

Figure 9.4

9.2.2

3

General multi-user wireless communication network.

Network Models

The wireless medium is inherently shared, and this directly motivates a study of multi-user communication techniques. Moreover, since we are also interested in multi-user diversity, we need to extend our model from the point-to-point scenario (eq. 9.2) to the network case. The general communication network (illustrated in ﬁg. 9.4) consists of n nodes trying to communicate with each other. In the scalar ﬂat-fading wireless channel, the received symbol Yi (t) at the ith node is given by Yi (t) =

n

hi,j Xj (t) + Zi (t),

(9.5)

j=1

j=i

where hi,j is determined by the channel attenuation between nodes i and j. Given this general model, one way of abstracting the multi-user communication problem is through embedding it in an underlying communication graph GC where the n nodes are vertices of the graph and the edges of the graph represent a channel connecting the two nodes along with the interference from other nodes. The graph could be directed with constraints and channel transition probability depending on the directed graph. A general multi-user network is therefore a fully connected graph with the received symbol at each node described as a conditional distribution dependent on the messages transmitted by all other nodes. Such a graph is illustrated in ﬁg. 9.5. We examine diﬀerent communication topologies in section 9.4 and study the role of diversity in networks.

9.3

Multiple-Antenna Diversity The ﬁrst form of diversity that we examine in some detail is that of multiple-antenna diversity. A major development over the past decade has been the emergence of space-time (multiple-antenna) techniques that enable high-rate, reliable communication over fading wireless channels. In this section we highlight some of the theoretical underpinnings of this topic. More details about practical code construc-

9.3

Multiple-Antenna Diversity

249

ACCESS POINT (B)

3 h

1,3

h1,B

h h

2,3

h2,B

3,n

1,n

1

1 n

h

1,2

hn,B

h

2

h

n

2,n

2

Figure 9.5 Graph representation of communication topologies. On the left is a general topology and on the right is a hierarchical topology.

tions can be found in Tarokh et al. (1998), Diggavi et al. (2004b), and references therein. Reliable information transmission over fading channels has a long and rich history; see Ozarow et al. (1994) and references therein. The importance of multiple antenna diversity was recognized early; see, for example, Brennan (1959). However, most of the focus until the mid-1990s was on receive diversity, where multiple “looks” of the transmitted signal were obtained using many receive antennas (see equation 9.3 using Mt = 1). The use of multiple transmit antennas was restricted to sending the same signal over each antenna, which is a form of repetition coding (Wornell and Trott, 1997). During the mid-1990s several researchers started to investigate the idea of coding across transmit antennas to obtain higher rate and reliability (Foschini, 1996; Tarokh et al., 1998; Telatar, 1999). One focus was on maximizing the reliable transmission rate, i.e., channel capacity, without requiring a bound on the rate at which error probability diminishes (Foschini, 1996; Telatar, 1999). However another point of view was explored where nondegenerate correlation was introduced between the information streams across the multiple transmit antennas in order to guarantee a certain bound on the rate at which the error probability diminishes (Tarokh et al., 1998). These approaches have led to the broad area of space-time codes, which is still an active research topic. In section 9.3.1 we ﬁrst start with an understanding of reliable transmission rate over multiple-antenna channels. In particular we examine the rate advantages of multiple transmit and receive antennas. Then in section 9.3.2 we introduce the notion of diversity order, which captures transmission reliability (error probability) in the high signal-to-noise ratio (SNR) regime. This allows us to develop criteria for space-time codes which guarantee a given reliability. section 9.3.3 examines the fundamental trade-oﬀ between maximizing rate and reliability. 9.3.1

Capacity of Multiple-Antenna Channels

The concept of capacity was ﬁrst introduced by Shannon (1948), where it was shown that even in noisy channels, one can transmit information at positive rates with the

250

Diversity in Communication: From Source Coding to Wireless Networks

error probability going to zero asymptotically in the coding block size. The seminal result was that for a noisy channel whose input at time k is {Xk } and output is {Yk }, there exists a number C such that 1 T T sup I(X ; Y ) , (9.6) C = lim T →∞ T p(xT ) T

T

p(x ,y ) where the mutual information is given by I(X T ; Y T ) = EX T ,Y T [log( p(x T )p(y T ) )], p(·) is the probability density function, and for convenience we have denoted X T = {X1 , . . . , XT } and similarly for Y T (Cover and Thomas, 1991). In Shannon (1948) it was shown that asymptotically in block length T , there exist codes which can transmit information at all rates below C with arbitrarily small probability of error over the noisy channel. Perhaps the most famous illustration of this idea was the formula derived in Shannon (1948) for the capacity C of the additive white Gaussian noise channel with noise variance σ 2 and input power constraint P :

C=

P 1 log(1 + 2 ). 2 σ

(9.7)

In this section we will focus mostly on the ﬂat-fading channels where, in equation 9.3, we have ν = 0. The generalizations of these ideas for frequencyselective channels (i.e., ν > 0) can be easily carried out (see Biglieri et al., 1998; Diggavi et al., 2004b, and references therein). We begin with the case where we are allowed to develop transmit schemes which code across multiple (B) realizations of the channel matrix {H(b) }B b=1 (see ﬁg. 9.3). In such a case, we can again deﬁne a notion of reliable transmission rate, where the error probability decays to zero when we develop codes across asymptotically large numbers of transmit blocks (i.e., B → ∞). We examine this for a coherent receiver, where the receiver uses perfect channel state information {H(b) } for each transmission block. But the transmitter is assumed not to have access to the channel realizations. To gain some intuition, consider ﬁrst the case when each transmission block is large, i.e., T → ∞. If we have one transmit antenna (Mt = 1), the channel vector response is a vector h(b) ∈ C Mr (see equation 9.4 in section 9.2). Therefore the reliable transmission rate for any (b) 2 particular block can be generalized4 {h(k)} from (9.7) as log(1 + ||h || P ). Note σ2

that when we are dealing with complex channels (as is usual in communication with in-phase and quadrature-phase transmissions), the factor of 1/2 disappears (Neeser and Massey, 1993) when we adapt the expression from equation 9.7. Now, if one codes across a large number of transmission blocks (B → ∞), for a stationary and ergodic sequence of {h(b) } we would expect to get a reliable transmission rate that is the average of this quantity. This intuition has been made precise in Ozarow et al. (1994), and references therein, for ﬂat-fading channels (ν = 0), even when we do not have T → ∞, but we have B → ∞. Therefore when we have only receive diversity, i.e., Mt = 1, for a given Mr , it is shown (Ozarow et al., 1994) that the

9.3

Multiple-Antenna Diversity

251

capacity is given by ||h||2 P ) , C = E log(1 + σ2

(9.8)

where the expectation is taken over the fading channel {h(b) } and the channel sequence is assumed to be stationary and ergodic. This is called the ergodic channel capacity (Ozarow et al., 1994). This is the rate at which information can be transmitted if there is no feedback of the channel state ({h(b) }) from the receiver to the transmitter. If there is feedback available about the channel state, one can do slightly better through optimizing the allocation of transmitted power by “waterﬁlling” over the fading channel states. The problem of studying the capacity of channels with causal transmitter-side information was introduced in Shannon (1958a), where a coding theorem for this problem was proved. Using ideas from there and perfect transmitter channel state information, capacity expressions that generalize equation 9.8 have been developed (Goldsmith and Varaiya, 1997). However, for fast time-varying channels the instantaneous feedback could be diﬃcult, resulting in an outdated estimate of the channel being sent back (Caire and Shamai, 1999; Viswanathan, 1999). However, the basic question of impact of feedback on capacity of time-varying channels is still not completely understood, and for developing the basic ideas in this chapter, we will deal with the case where the transmitter does not have access to the channel state information. We refer the interested reader to Biglieri et al. (1998) for a more complete overview of such topics. Now let us focus our attention on the multiple transmit and receive antenna channel where again as before we consider the coherent case, i.e., the receiver has perfect channel state information (CSI) H(b) . In the ﬂat-fading case where ν = 0, when we code across B transmission blocks, the mutual information for this case is 1 (b) B (b) B I({X(b) }B b=1 ; {Y }b=1 , {H }b=1 ), BT since we assume that the receiver has access to CSI. Using the chain rule of mutual information (Cover and Thomas, 1991), this can be written as 1 (b) B (b) B (b) B (b) B } ) . (9.9) ; {H } ) + I({X } ; {Y } |{H I({X(b) }B R(B) = b=1 b=1 b=1 b=1 b=1 BT Using the the assumption that the input {x(k)} is independent of the fading process (as the transmitter does not have CSI), equation 9.9 is equal to

1 (b) B (B) . (9.10) = {H(b) }B EH I {X(b) }B R(B) = b=1 ; {Y }b=1 |H b=1 BT Now, if we use the memoryless property of the vector Gaussian channel obtained by conditioning on H(b) and also due to the assumption5 that {H(b) } is i.i.d. over b, for when B → ∞ we get that R(B) =

1 |Rz + HRx H∗ | (b) B (b) B I({X(b) }B )], b=1 ; {Y }b=1 , {H }b=1 ) = EH [log( B→∞ BT |Rz | lim

(9.11)

252

Diversity in Communication: From Source Coding to Wireless Networks

where the expectation6 is taken over the random channel realizations {H(b) }. An operational meaning to this expression can be given by showing that there exist codes which can transmit information at this rate with arbitrarily small probability of error (Telatar, 1999). In general, it is diﬃcult to evaluate equation 9.11 except for some special cases. If the random matrix H(b) consists of zero-mean i.i.d. Gaussian elements, Telatar (1999) showed that C = EH [log(|I +

P HH∗ |)] Mt σ 2

(9.12)

is the capacity of the fading matrix channel.7 Therefore in this case, to achieve capacity the optimal codebook is generated from an i.i.d. Gaussian input {x(b) } P I. with Rx = E[xx∗ ] = M t The expression in equation 9.12 shows that the capacity is dependent on the eigenvalue distribution of the random matrix H with Gaussian i.i.d. components. This important connection between capacity of multiple-antenna channels and the mathematics related to eigenvalues of random matrices (Edelman, 1989) was noticed in Telatar (1999), where it was shown that the capacity could be numerically computed using Laguerre polynomials (Edelman, 1989; Muirhead, 1982; Telatar, 1999). Theorem 9.1 (Telatar, 1999) The capacity C of the channel with Mt transmitters and Mr receivers and average power constraint P is given by

∞ Tmin −1 k! Pλ C= log(1 + 2 ) λTmax −Tmin [LTk max −Tmin (λ)]2 e−λ dλ , σ M k + T t max − Tmin 0 k=0

where Tmax = max(Mt , Mr ), Tmin = min(Mt , Mr ), and Lm k (·) is the generalized Laguerre polynomial of order k with parameter m (Gradshteyn and Ryzhik, 1994). In Foschini (1996) it was observed that when Mt = Mr = M the capacity C grows linearly in M as M → ∞. Theorem 9.2 (Foschini, 1996) For Mt = Mr = M the capacity C given by (9.12) grows asymptotically linearly in M , i.e., lim

M →∞

C = c∗ (SNR) , M

(9.13)

where c∗ (SNR) is a constant depending on SNR. This quantiﬁes the advantage of using multiple transmit and receive antennas and shows the promise of such architectures for high-rate reliable wireless communication.

9.3

Multiple-Antenna Diversity

253

To achieve the capacity given in equation 9.12, we require joint optimal (maximum-likelihood) decoding of all the receiver elements which could have large computational complexity. The channel model in equation 9.3 resembles a multiuser channel (Verdu, 1998) with user cooperation. A natural question to ask is whether the simpler decoding schemes proposed in multi-user detection would yield good performance on this channel. A motivation for this is seen by observing that for i.i.d. elements of the channel response matrix (ﬂat-fading) the normalized cross1 ∗ H H → IMt ). Therefore, since nature correlation matrix decouples (i.e., lim Mr →∞ Mr provides some decoupling, a simple “matched ﬁlter” receiver (Verdu, 1998) might perform quite well. In this context a matched ﬁlter for the ﬂat-fading channel in ˜ (k) = H∗ (k)y(k). Therefore, component-wise this means equation 9.3 is given by y that ˜ i (k) = ||hi (k)||2 xi (k) + y

Mt

˜i (k), h∗i (k)hj (k)xj (k) + z

i = 1, . . . , Mt .

(9.14)

j=1

j=i

ˆ i by including the By ignoring the cross-coupling between the channels we decode x “interference” from {xj }j=i as part of the noise. However, a tension arises between &M the decoupling of the channels and the added “interference” j=1t h∗i (k)hj (k)xj (k) j=i

from the other antennas, which clearly grows with the number of antennas. It is shown in Diggavi (2001), that the two eﬀects exactly cancel each other. Proposition 9.1 If H(k) = [h1 (k), . . . , hMt (k)] ∈ C Mr ×Mt and hl (k) ∼ C N (0, IMr ), l = 1, . . . , Mt , are i.i.d., then lim

Mr →∞

Mt =αMr

Mt h∗ (k)hj (k) 2 | i | = α almost surely. Mr j=1 j=i

Therefore, using this result it can be shown that the simple detector still retains the linear growth rate of the optimal decoding scheme (Diggavi, 2001). However, in the rate RI achievable for this simple decoding scheme, we do pay a price in terms of rate growth with SNR. Theorem 9.3 If Hi,j ∼ C N (0, 1), with i.i.d. elements, then lim

Mt →∞

Mt =αMr

1 I(Y, H; X) ≥ Mt

lim

Mt →∞

Mt =αMr

RI /Mt = log(1 +

1

P σ2 α + σP2

).

Multi-user detection (Verdu, 1998) is a good analogy to understand receiver structures in MIMO systems. The main diﬀerence is that unlike multiple access channels, the space-time encoder allows for cooperation between “users.” Therefore, the encoder could introduce correlations that can simplify the job of the decoder.

254

Diversity in Communication: From Source Coding to Wireless Networks

Such encoding structures using space-time block codes are discussed further in Diggavi et al. (2004b), and references therein. An example of using the multi-user detection approach is the result in theorem 9.3 where a simple matched ﬁlter receiver is applied. Using more sophisticated linear detectors, such as the decorrelating receiver and the MMSE receiver (Verdu, 1998), one can improve performance while still maintaining the linear growth rate. The decision feedback structures also known as successive interference cancellation, or onion peeling (Cover, 1975; Patel and Holtzman, 1994; Wyner, 1974) can be shown to be optimal, i.e., to achieve the capacity, when an MMSE multi-user interference suppression is employed and the layers are peeled oﬀ (Cioﬃ et al., 1995; Varanasi and Guess, 1997). However, decision feedback structures inherently suﬀer from error propagation (which is not taken into account in the theoretical results) and could therefore have poor performance in practice, especially at low SNR. Thus, examining nondecision feedback structures is important in practice. All of the above results illustrate that signiﬁcant gains in information rate (capacity) are possible using multiple transmit and receive antennas. The intuition for the gains with multiple transmit and receive antennas is that there are a larger number of communication modes over which the information can be transmitted. This is formalized by the observation (Diggavi, 2001; Zheng and Tse, 2002) that the capacity as a function of SNR, C(SN R), grows linearly in min(Mr , Mt ), even for a ﬁnite number of antennas, asymptotically in the SNR. Theorem 9.4 C(SN R) = min(Mr , Mt ). SN R→∞ log(SN R) lim

(9.15)

In the results above, the fundamental assumption was that the receiver had access to perfect channel state information, obtained through training or other methods. When the channel is slowly varying, the estimation error could be small since we can track the channel variations and one can quantify the eﬀect of such estimation errors. As a rule of thumb, it is shown by Lapidoth and Shamai (2002) that if the estimation error is small compared to SN1 R , these results would hold. Another line of work assumes that the receiver does not have any channel state information. The question of the information rate that can be reliably transmitted over the multiple-antenna channel without channel state information was introduced in Hochwald and Marzetta (1999) and has also been examined in Zheng and Tse (2002). The main result from this line of work shows that the capacity growth is again (almost) linear in the number of transmit and receive antennas, as stated formally next.

9.3

Multiple-Antenna Diversity

255

Theorem 9.5 If the channel is block fading with block length T and we denote K = min(Mt , Mr ), then for T > K + Mt , as SN R → ∞, the capacity is8 K C(SN R) = K 1 − log(SN R) + c + o(1) , T where c is a constant depending only on Mr , Mt , T .

outage

In fact, Zheng and Tse (2002) go on to show that the rate achievable by using a training-based technique is only a constant factor away from the optimal, i.e., it attains the same capacity-SNR slope as in theorem 9.5. Further results on this topic can be found in Hassibi and Marzetta (2002). Therefore, even in the noncoherent block-fading case, there are signiﬁcant advantages in using multiple antennas. Most of the discussion above was for the ﬂat-fading case where ν = 0 in equation 9.3. However, these ideas can be easily extended for the block timeinvariant frequency-selective channels where again the advantages of multipleantenna channels can be established (Diggavi, 2001). However, when the channels are not block time-invariant, the characterization of the capacity of frequencyselective channels is an open question. Outage In all of the above results, the error probability goes to zero asymptotically in the number of coding blocks i.e., B → ∞. Therefore, coding is assumed to take place across fading blocks, and hence it inherently uses the ergodicity of the channel variations. This approach would clearly entail large delays, and therefore Ozarow et al. (1994) introduced a notion of outage, where the coding is done (in the extreme case) just across one fading block, i.e., B = 1. Here the transmitter sees only one block of channel coeﬃcients, and therefore the channel is nonergodic, and the strict Shannon-sense capacity is zero. However, one can deﬁne an outage probability that is the probability with which a certain rate R is possible. Therefore, for a block time-invariant channel with a single channel realization H(b) = H the outage probability can be deﬁned as follows. Deﬁnition 9.1 The outage probability for a transmission rate of R and a given transmission strategy p(X) is deﬁned as (9.16) Poutage (R, p(X)) = P H : I(X; Y|H(b) = H) < R . P I) then (abusing Therefore, if one uses a white Gaussian codebook (Rx = M t notation by dropping the dependence on p(X)) we can write the outage probability at rate R as 3 2 P ∗ HH |) < R . (9.17) Poutage (R) = P log(|I + Mt σ 2

256

Diversity in Communication: From Source Coding to Wireless Networks

It has been shown (Zheng and Tse, 2003) that at high SNR the outage probability is the same as the frame-error probability in terms of the SNR exponent. Therefore, to evaluate the optimality of practical coding techniques, one can compare, for a given rate, how far the performance of the technique is from that predicted through an outage analysis. Moreover, the frame-error rates and outage capacity comparisons in Tarokh et al. (1998) can also be formally justiﬁed through this argument. 9.3.2

Diversity Order

In section 9.3.1 the focus was on achievable transmission rate. A more practical performance criterion is probability of error. This is particularly important when we are coding over a small number of blocks (low delay) where the Shannon capacity is zero (Ozarow et al., 1994) and we are in the outage regime as was seen above. By characterizing the error probability, we can also formulate design criteria for space-time codes. Since we are allowed to transmit a coded sequence, we are interested in the probability that an erroneous codeword9 e is mistaken for the transmitted codeword x. This is called the pairwise error probability (PEP) and is used to bound the error probability. This analysis relies on the condition that the receiver has perfect channel state information. However, a similar analysis can be done when the receiver does not know the channel state information, but has statistical knowledge of the channel (Hochwald and Marzetta, 2000). For simplicity, we shall again focus on a ﬂat-fading channel (where ν = 0) and when the channel matrix contains i.i.d. zero-mean Gaussian elements, i.e., Hi,j ∼ C N (0, 1). Many of these results can be easily generalized for ν > 0 as well as for correlated fading and other fading distributions. Consider a codeword sequence X = [xt (0), . . . , xt (T −1)]t , where x(k) = [x1 (k), . . . , xMt (k)]t (deﬁned in eq. 9.4). In the case when the receiver has perfect channel state information, we can bound the PEP between two codeword sequences x and e (denoted by P (x → e)) as follows (Guey et al., 1999; Tarokh et al., 1998): Mr 1 . (9.18) P (x → e) ≤ Mt Es n=1 (1 + 4N0 λn ) P Es = M is the power per transmitted symbol, λn are the eigenvalues of the matrix t A(x, e) = B∗ (x, e)B(x, e), and ⎛ ⎞ x1 (0) − e1 (0) ... xMt (0) − eMt (0) ⎟ ⎜ .. .. .. ⎟ . (9.19) B(x, e) = ⎜ . . . ⎝ ⎠ x1 (N − 1) − e1 (N − 1) . . . xMt (N − 1) − eMt (N − 1)

9.3

Multiple-Antenna Diversity

257

If q denotes the rank of A(x, e), (i.e., the number of nonzero eigenvalues) then we can bound equation 9.18 as −Mr q −qMr Es λn . (9.20) P (x → e) ≤ 4N0 n=1 We deﬁne the notion of diversity order as follows. Deﬁnition 9.2 A coding scheme which has an average error probability P¯e (SN R) that behaves as log(P¯e (SN R)) = −d SN R→∞ log(SN R) lim

(9.21)

as a function of SN R is said to have a diversity order of d. In words, a scheme with diversity order d has an error probability at high SNR behaving as P¯e (SN R) ≈ SN R−d (see ﬁg. 9.6). One reason to focus on such a behavior for the error probability can be seen from the following intuitive argument for a simple scalar fading channel (Mt = 1 = Mr ). It is well known that for particular frame b, the error probability for binary transmission, conditioned on the √ (b) (b) (b) 2SN R |h | (Proakis, 1995). channel realization h , is given by Pe (h ) = Q √ √ Hence if |h(b) | 2SN R " 1, then Pe (h(b) ) ≈ 0, and if |h(b) | 2SN R 1, then Pe (h(b) ) ≈ 12 . Therefore a frame is in error with high probability when the channel gain |h(b) |2 SN1 R , i.e., when the channel is in a “deep fade.” Therefore the average 1 (b) 2 error probability is well approximated by the probability |h SN R . For 9 | 8 2 that 1 1 high SNR we can show that, for h ∼ C N (0, 1), P |h| < SN R ≈ SN R , and this explains the behavior of the average error probability. Although this is a crude analysis, it brings out the most important diﬀerence between the additive white Gaussian noise (AWGN) channel and the fading channel. The typical way in which an error occurs in a fading channel is due to channel failure, i.e., when the channel gain |h| is very small, less than SN1 R . On the other hand, in an AWGN channel errors occur when the noise is large, and since the noise is Gaussian it has an exponential tail, causing this to be very unlikely at high SNR. Given the deﬁnition 9.2 of diversity order, we see that the diversity order in equation 9.20 is at most qMr . Moreover, in inequlaity 9.20 we notice that we also q obtain a coding gain of ( n=1 λn )1/q . Note that in order to obtain the average error probability, one can calculate a naive union bound using the pairwise error probability given in equation 9.20 but this may not be tight. A more careful upper bound for the error probability can be derived (Zheng and Tse, 2003). However, if we ensure that every pair of codewords satisﬁes the diversity order in equation 9.20, then clearly the average error probability satisﬁes it as well. This is true when the transmission rate is held constant with respect to SNR, i.e., a ﬁxed-rate code. Therefore, in the case of ﬁxed rate code design the simple pairwise error probability given in equation 9.20 is suﬃcient to obtain the correct diversity order.

Diversity in Communication: From Source Coding to Wireless Networks

Error Probability

258

d d

2

1

SNR (dB)

Figure 9.6

Relationship between error probability and diversity order.

In order to design practical codes that achieve a performance target we need to glean insights from the analysis to state design criteria. For example, in the ﬂat-fading case of equation 9.20 we can state the following rank and determinant design criteria. Design criteria for space-time codes over ﬂat-fading channels (Tarokh et al., 1998): Rank criterion: In order to achieve maximum diversity Mt Mr , the matrix B(x, e) from equation 9.19 has to be full rank for any codewords x, e. If the minimum rank of B(x, e) over all pairs of distinct codewords is q, then a diversity order of qMr is achieved. q Determinant criterion: For a given diversity order target of q, maximize ( n=1 λn )1/q over all pairs of distinct codewords. Over the past few years, there have been signiﬁcant developments in designing codes which can guarantee a given reliability (error probability). An exhaustive listing of all these developments is beyond the scope of this chapter, but we give a glimpse of the recent developments. The interested reader is referred to Diggavi et al. (2004b), and references therein. Pioneering work on trellis codes for Gaussian channels was done in Ungerboeck (1982). In Tarokh et al. (1998), the ﬁrst space-time trellis code constructions were presented. In this seminal work, trellis codes were carefully designed to meet the design criteria for minimizing error probability. In parallel a very simple coding idea for Mt = 2 was developed in Alamouti (1998). This code achieved maximal diversity order of 2Mr and had a very simple decoder associated with it. The elegance and simplicity of the Alamouti code has made it a candidate for next generation of wireless systems which are slated to utilize space-time codes. The basic idea of the Alamouti code was extended to orthogonal designs in Tarokh et al. (1999). The publication of Tarokh et al. (1998) and Alamouti (1998), created a signiﬁcant community of researchers working on space-time code constructions. Over the past few years, there has been signiﬁcant progress in the construction of space-time codes for coherent channels. The design of codes that are linear in the complex ﬁeld was proposed in Hassibi and Hochwald (2002), and eﬃcient decoders for such codes

9.3

Multiple-Antenna Diversity

259

were given in Damen et al. (2000). Codes based on algebraic rotations and numbertheoretic tools are developed in El-Gamal and Damen (2003) and Sethuraman et al. (2003). A common assumption in all these designs was that the receiver had perfect knowledge of the channel. Techniques based on channel estimation and the evaluation of the degradation in performance for space-time trellis codes was examined in Naguib et al. (1998). In another line of work, non-coherent space-time codes were proposed in Hochwald and Marzetta (2000). This also led to the design and analysis of diﬀerential space-time codes for ﬂat fading channels (Hochwald and Sweldens, 2000; Hughes, 2000; Tarokh and Jafarkhani, 2000). This was also examined for frequency selective channels in Diggavi et al. (2002a). As can be seen, the topic of space-time codes is still evolving and we just have a snapshot of the recent developments. 9.3.3

Rate-Diversity Tradeoﬀ

A natural question that arises is how many codewords can we have which allow us to attain a certain diversity order. For a ﬂat Rayleigh fading channel, this has been examined (Lu and Kumar, 2003; Tarokh et al., 1998) and the following result was obtained.10 Theorem 9.6 If we use a transmit signal with constellation of size |S| and the diversity order of the system is qMr , then the rate R that can be achieved is bounded as R ≤ (Mt − q + 1) log2 |S|

(9.22)

in bits per transmission. One consequence of this result is that for maximum (Mt Mr ) diversity order we can transmit at most log2 |S| bits/sec/Hz. Note that the trade-oﬀ in theorem 9.6 is established with a constraint on the alphabet size of the transmit signal, which may not be fundamental from an information-theoretic point of view. An alternate viewpoint of the rate-diversity trade-oﬀ has been explored in Zheng and Tse (2003) from a Shannon-theoretic point of view. In that work the authors are interested in the multiplexing rate of a transmission scheme. Deﬁnition 9.3 A coding scheme which has a transmission rate of R(SN R) as a function of SN R is said to have a multiplexing rate r if R(SN R) = r. SN R→∞ log(SN R) lim

(9.23)

Therefore, the system has a rate of r log(SN R) at high SN R. One way to contrast this with the statement in theorem 9.6 is to note that the constellation size is also allowed to become larger with SN R. The naive union bound of the pairwise

260

Diversity in Communication: From Source Coding to Wireless Networks

mn(0, Mt Mr )

Diversity order

mn(0, (Mt − 1)(Mr − 1)) . . .

mn(k, (Mt − k)(Mr − k)) . . .

mn(min(Mt , Mr ), 0))

Multiplexing rate Figure 9.7 Rate-diversity trade-oﬀ curve.

error probability (eq. 9.18) has to be used with care if the constellation size is also increasing with SNR. There is a trade-oﬀ between the achievable diversity and the multiplexing gain, and d∗ (r) is deﬁned as the supremum of the diversity gain achievable by any scheme with multiplexing gain r. The main result in Zheng and Tse (2003) states the following. Theorem 9.7 For T > Mt + Mr − 1, and K = min(Mt , Mr ), the optimal trade-oﬀ curve d∗ (r) is given by the piecewise linear function connecting points in (k, d∗ (k)), k = 0, . . . , K where d∗ (k) = (Mr − k)(Mt − k).

(9.24)

If r = k is an integer, the result can be notionally interpreted as using Mr − k receive antennas and Mt − k transmit antennas to provide diversity while using k antennas to provide the multiplexing gain. However, this interpretation is not physical but really an intuitive explanation of the result in theorem 9.7. Clearly this result means that one can get large rates which grow with SNR if we reduce the diversity order from the maximum achievable. This diversity-multiplexing tradeoﬀ implies that a high multiplexing gain comes at the price of decreased diversity gain and is a manifestation of a corresponding trade-oﬀ between error probability and rate. This trade-oﬀ is depicted in ﬁg. 9.7. Therefore, as illustrated in Theorems 9.6 and 9.7, the trade-oﬀ between diversity and rate is an important consideration both in terms of coding techniques (theorem 9.6) and in terms of Shannon theory (theorem 9.7). A diﬀerent question was proposed in Diggavi et al. (2003, 2004a), where it was asked whether there exists a strategy that combines high-rate communications

9.4

Multi-user Diversity

261

with high reliability (diversity). Clearly the overall code will still be governed by the rate-diversity trade-oﬀ, but the idea is to ensure the reliability (diversity) of at least part of the total information. This allows a form of communication where the highrate code opportunistically takes advantage of good channel realizations whereas the embedded high-diversity code ensures that at least part of the information is received reliably. In this case, the interest was not in a single pair of multiplexing rate and diversity order (r, d), but in a tuple (ra , da , rb , db ) where rate ra and diversity order da was ensured for part of the information with rate-diversity pair (rb , db ) guaranteed for the other part. A class of space-time codes with such desired characteristics have been constructed in Diggavi et al. (2003, 2004a). From an information-theoretic point of view, Diggavi and Tse (2004) focused on the case when there is one degree of freedom (i.e., min(Mt , Mr ) = 1). In that case if we consider da ≥ db without loss of generality, the following result was established (Diggavi and Tse, 2004): Theorem 9.8 When min(Mt , Mr ) = 1, then the diversity-multiplexing trade-oﬀ curve is successively reﬁnable, i.e., for any multiplexing gains ra and rb such that ra + rb ≤ 1, the diversity orders da ≥ db , da = d∗ (ra ), db = d∗ (ra + rb ),

(9.25)

are achievable, where d∗ (r) is the optimal diversity order given in theorem 9.7. Since the overall code has to still be governed by the rate-diversity trade-oﬀ given in theorem 9.7, it is clear that the trivial outer bound to the problem is that da ≤ d∗ (ra ) and db ≤ d∗ (ra + rb ). Hence theorem 9.3 shows that the best possible performance can be achieved. This means that for min(Mt , Mr ) = 1, we can design ideal opportunistic codes. This new direction of enquiry is being currently explored.

9.4

Multi-user Diversity In section 9.3, we explored the importance of using many fading realizations through multiple antennas for reliable, high-rate, single-user wireless communication. In this section we explore another form of diversity where we can view diﬀerent users as a form of multi-user diversity. This is because each user potentially has independent channel conditions and local interference environment. This implies that in ﬁg. 9.5, the fading links between users are random and independent of each other. Therefore, this diversity in channel and interference conditions can be exploited by treating the independent links from diﬀerent users as conduits for information transfer. In order to explore this idea further we ﬁrst digress to discuss communication topologies. As seen in section 9.2 (see ﬁg. 9.5), we can view the n-user communication network through the underlying graph GC . One topology which is very

262

Diversity in Communication: From Source Coding to Wireless Networks

commonly seen in practice is obtained by giving special status to one of the nodes as the base-station or access point. The other nodes can only communicate to the base station. We call such a topology the hierarchical communication topology (see ﬁg. 9.5). An alternate topology that has emerged more recently is when the nodes organize themselves without a centralized base station. Such a topology is called an ad hoc communication topology, where the nodes relay information from source to destination, typically through multiple “nearest neighbor” communication hops (see also ﬁg. 9.8). In both these topologies there is potential to utilize multi-user diversity, but the methods to do so are distinct. Therefore we explore them separately in Sections 9.4.1 and 9.4.2. 9.4.1

Opportunistic Scheduling

In the hierarchical topology, we distinguish between two types of problems; the ﬁrst is the uplink channel where the nodes communicate to the access point (manyto-one communication or the multiple access channel), and the second is the downlink channel where the access point communicates to the nodes (one-to-many communication or the broadcast channel). The idea of multi-user diversity can be further motivated by looking at the scalar fading multiple access channel. If the users are distributed across geographical areas, their channel responses will be diﬀerent depending on their local environments. This is modeled by choosing the users’ channels to vary according to channel distributions that are chosen to be independent and identical across users. The rate region for the uplink channel for this case was characterized in Knopp and Humblet (1995) where it was shown that in order to maximize the total information capacity (the sum rate), it is optimal to transmit only to the user with the best channel. For the scalar channel, the channel gain determines the best channel. The result (in Knopp and Humblet, 1995) when translated to rapidly fading channels results in a form of time-division multiple access (TDMA), where the users are not preassigned time slots, but are scheduled according to their respective channel conditions. Even if a particular user at the current time might be in a deep fade, there could be another user who has good channel conditions. Hence this strategy is a form of multi-user diversity where the diversity is viewed across users. Here the multi-user diversity (which arises through independent channel realizations across users) can be harnessed using an appropriate scheduling strategy. If the channels vary rapidly in time, the idea is to schedule users when their channel state is close to the peak rate that it can support. A similar result also holds for the scalar fading broadcast channel (Li and Goldsmith, 2001; Tse, 1997). Note that this requires feedback from the users to the base station about the channel conditions. The feedback could be just the received SNR. These results are proved on the basis of two assumptions. One is that all the users have identically distributed (i.e., symmetric) channels and the other is that we are interested in long-term rates. We focus on the ﬁrst assumption, and later brieﬂy return to the question about delay. In wireless networks, the users’ channel is almost never symmetric. Nodes that

9.4

Multi-user Diversity

263

are closer to the base station experience much better channels on the average than nodes that are further away (due to path loss, see section 9.2). Therefore, using a TDMA technique that allows exclusive use of the channel to the best user would be inherently unfair to users who are further away. Suppose the long-term average rate {Tk } is to be provided to the users. The criterion used in the result in Knopp & and Humblet (1995) was the sum throughput of all the users, i.e., max k Tk . This criterion can be maximized by only scheduling the nodes with strong channels, and this could be an unfair allocation of resources across users. In order to translate the intuition about multi-user diversity into practice, one would need to ensure fairness among users. The idea in Bender et al. (2000); Jalali et al. (2000) and Chaponniere et al., is to use a proportionally fair criterion for scheduling which &K maximizes k=1 log(Tk ). This idea is inherently used in the downlink scheduling algorithm used in IS-856 (Bender et al., 2000; Chaponniere et al.; Jalali et al., 2000) (also known as the high data rate—HDR 1xEV-DO system). The scheduling algorithm implemented in the 1xEV-DO system keeps track of the average throughput Tk (t) of user k in a past window of length tc . Let the rate that can be supported to user k at time t be denoted by Rk (t). At time t, the k (t) scheduling algorithm transmits to the user with the largest R Tk (t) among the active users. The average throughputs are then updated given the current allocation. Since this idea ensures fairness while utilizing multi-user diversity, it is an instantiation of an opportunistic scheduler. This scheduling algorithm described above relies on the rates supported by the users to vary rapidly in time. But this assumption can be violated when the channels are constant or are very slowly time-varying. In order to artiﬁcially induce time variations, Viswanath et al., 2002) propose to use multiple transmit antennas and introduce random phase rotations between the antennas to simulate fast fading. This idea of phase-sweeping for multiple antennas has been also proposed in Weerackody (1993) and Hiroike et al. (1992) in the context of creating time diversity in single-user systems. With such artiﬁcially induced fast channel variations, the same scheduling algorithm used in IS-856 (outlined above) inherently captures the multi-user spatial diversity of the network. In Viswanath et al. (2002), this technique is shown to achieve the maximal diversity order (see section 9.3.2) for each user, asymptotically in number of (uniformly distributed) users. In a heavily loaded system (large number of users) and where there is a uniform distribution of users, the technique proposed in Viswanath et al. (2002) is attractive. However, for lightly loaded systems, or when delay is an important QoS criterion, its desirability is less clear. Given that the technique proposed in Viswanath et al. (2002) is based on a rate-based QoS criterion, it cannot provide delay guarantees for the jobs of diﬀerent users. This motivates the discussion of scheduling algorithms for job-based QoS criteria. In job-based criteria, the requests are assumed to come in at certain arrival times ai , and we have information about the size si (say in bytes). Response time is deﬁned to be ci − ai where ci is the time when a request was fully serviced and ai is the arrival time of the request. This is a standard QoS criterion for a request.

264

Diversity in Communication: From Source Coding to Wireless Networks i Relative response is deﬁned as ci −a (Bender et al., 1998). Relative response was si proposed in the context of heterogeneous workloads, such as the Web, i.e., requests for data of diﬀerent sizes (thus, diﬀerent si ). The above criteria relate to guarantees per request; we could also give guarantees only over all requests. For example, the overall performance criterion for a set of jobs could be the l∞ norm, namely, i (i.e., max relative response). maxi (ci − ai ) (i.e., max response time) or maxi ci −a si Other criteria based on average instead of maximum are also studied. The new generation of wireless networks can support multiple transmission rates depending on the channel conditions. Assuming an accurate communicationtheoretic model for the physical layer achievable rates (as described in section 9.3), job-scheduling algorithms are proposed and analyzed for various QoS criteria in Becchetti et al. (2002). These algorithms utilize diverse job requirements of the users to provide provable guarantees in terms of the job-scheduling criteria. These discussions just illustrate how multi-user diversity can be utilized in hierarchical networks. This form of opportunistic scheduling is an important part of the new generation of wireless data networks.

9.4.2

Mobile Ad Hoc Networks

In an ad hoc communication topology (network), one need not transmit information directly from source to destination, but instead can use other users which act as relays to help communication of information to its ultimate destination. Such multihop wireless networks have rich history (see, for example, Hou and Li, 1986, and references therein). In an important step toward systematically understanding the capacity of wireless networks, Gupta and Kumar (2000) explored the behavior of wireless networks asymptotically in the number of users. In their setup, n nodes were placed independently and randomly at locations {Si } in a ﬁnite geographical area (a scaled unit disk). Also m = Θ(n) source and destination (S-D) pairs {(Si , Ti )} are randomly chosen as shown in ﬁg. 9.8.11 The model assumes that each source Si has an inﬁnite stream of (information) packets to send to its respective destination Ti . The nodes are allowed to use any scheduling and relaying strategy through other nodes to send the packets from the sources to the destinations (see ﬁg. 9.8). The goal is to analyze the best possible long-term throughput per S-D pair asymptotically in the number of nodes n. In Gupta and Kumar (2000), a single-user communication model was used where each node transmitted information to its intended receiver (relay or destination node), and the receiver considered the interference from other nodes as part of the noise. Therefore in the communication model, a successful transmission of rate R occurred when the signal-to-interference-plus-noise ratio (SINR) was above a certain threshold β. Clearly, such a communication model can be improved by attempting to decode the “interference” from other nodes using sophisticated multi-user decoding (Verdu, 1998). But such a decoding strategy was not considered by Gupta and Kumar (2000) and therefore this need not be an information-

9.4

Multi-user Diversity

265

mnT2

mnSm mnT1

mnS1 mnTm

mnS2

Routes from sources {Si } denoted by ﬁlled circles to destinations {Ti } denoted by shaded circles. Figure 9.8

theoretically optimal strategy. In order to represent wireless signal transmission, the signal strength variation was modeled only through path loss (see section 9.2) with exponent α. Therefore, if {Pi } are the powers at which the various nodes transmitted, then the SINR from node i to node j is deﬁned as SIN R =

Pi |Si −Sj |α

σ2 +

&

k∈I

k=i

Pk |Sk −Sj |α

,

(9.26)

where I is the subset of users simultaneously transmitting at some time instant. Next, we need to deﬁne the notion of throughput per S-D pair more precisely. Deﬁnition 9.4 For a scheduling and relay policy π, let Miπ (t) be the number of packets from source node Si to its destination node Ti successfully delivered at time t. A long˜ term throughput λ(n) is feasible if there exists a policy π such that for every source-destination pair T 1 π ˜ Mi (t) ≥ λ(n) . T →∞ T t=1

lim inf

(9.27)

˜ We deﬁne the throughput λ(n) as the highest achievable λ(n). Note that λ(n) is a random quantity which depends on the node locations of the users. Our interest is in the scaling law governing λ(n), i.e., the behavior of λ(n) asymptotically in n. One of the main results of Gupta and Kumar (2000) was the following. Theorem 9.9 There exist constants c1 and c2 such that 3 2 c1 R is feasible = 1, lim P λ(n) = √ n→∞ n log n

3 2 c2 R lim P λ(n) = √ is feasible = 0 . n→∞ n

266

Diversity in Communication: From Source Coding to Wireless Networks

Therefore, the long-term per-user throughput decays as O( √1n ), showing that high per-user throughput may be diﬃcult to attain in large-scale (ﬁxed) wireless networks. This result has been recently strengthened: it was shown by Franceschetti et al. (2004) that λ(n) = Θ( √1n ). One way to interpret this result is the following. If n nodes are randomly placed in a unit disk, nearest neighbors (with high probability) are at a distance O( √1n ) apart. Gupta and Kumar (2000) show that it is important to schedule a large number of simultaneous short transmissions, i.e., between nearest-neighbors. If randomly chosen source-destination pairs are O(1) distance apart and we can only √ schedule nearest neighbor transmissions, information has to travel O( n) hops to reach its destination. Since there can be at most O(n) simultaneous transmissions at a given time instant, this imposes a O( √1n ) upper bound on such a strategy. This is an intuitive argument, and a rigorous proof of theorem 9.9 is given in Gupta and Kumar (2000) among other interesting results. Note that the coding strategy in theorem 9.9 was simple and the interference was treated as part of the noise. An open question concerns the throughput when we use sophisticated multi-user codes and decoding is used. Therefore, for such an information-theoretic characterization, understanding the rate region of the relay channel is an important component (Cover and Thomas, 1991). The relay channel was introduced in van der Meulen (1977), and the rate region for special cases was presented in Cover and El Gamal (1979). Recently Leveque and Telatar (2005); Xie and Kumar (2004), and Gupta and Kumar (2003) have established that even with network information-theoretic coding strategies, the per S-D pair throughput scaling law decays with the number of users n. A natural question that arises is whether there is any mechanism by which one can improve the scaling law for throughput in wireless networks. Mobility was one such mechanism examined in Grossglauser and Tse (2002). In the model studied, random node mobility was allowed and the locations {Si (t)} vary in a uniform, stationary, and ergodic manner over the entire disk (see ﬁg. 9.9). In the presence of such symmetric (among users) and “space-ﬁlling” mobility patterns, the following surprising result was established in (Grossglauser and Tse, 2002). Theorem 9.10 There exists a scheduling and relaying policy π and a constant c > 0 such that lim P {λ(n) = cR is feasible} = 1 .

n→∞

(9.28)

Therefore, node mobility allows us to achieve a per-user throughput of Θ(1). The main reason this was attainable was that packets are relayed only through a ﬁnite number of hops by utilizing node mobility. Thus, a node carries packets over O(1) distance before relaying it, and therefore Grossglauser and Tse (2002) shows that, with high probability, if the mobility patterns are space-ﬁlling, the number of √ hops needed from source to destination is bounded instead of growing as O( n) in

9.4

Multi-user Diversity

267

Mobility in ad hoc networks. The ﬁgure on the left shows a spaceﬁlling mobility model where the nodes uniformly cover the region. The ﬁgure on the right shows a limited one-dimensional mobility model where nodes move along ﬁxed line segments. Figure 9.9

the case of ﬁxed (nonmobile) wireless networks (Gupta and Kumar, 2000). However, the above mobility model is a generous one, since (1) it is homogeneous, i.e., every node has the same mobility process, and (2) the sample path of each node “ﬁlls the space over time.” This means that there is a nonzero probability that the node visits every part of the geographical region or area. A natural question is whether the throughput result in Grossglauser and Tse (2002) strongly depends on these two features of the mobility model. In Diggavi et al. (2002b), a diﬀerent mobility model is introduced which embodies two salient features that many real mobility processes seem to possess (e.g., cars traveling on roads, people walking in buildings or cities, trains, satellites circling earth), which are not captured by the model in Grossglauser and Tse (2002). First, an individual node typically visits only a small portion of the entire space, and rarely leaves this preferred region. Second, the nodes do move frequently within their preferred regions, and an individual region often covers a large distance. As an extreme abstraction of such mobility processes, Diggavi et al. (2002b) studied mobility patterns where nodes move along a given set of one-dimensional paths (see ﬁg. 9.9). In particular, the mobility patterns were restricted to random line segments and once chosen, the conﬁguration of line segments are ﬁxed for all time. Therefore, given the conﬁguration, the only randomness arose through user mobility along these line segments. In order to isolate the eﬀects of one-dimensional mobility from edge eﬀects, Diggavi et al. (2002b) studied a model in which the nodes are on a unit sphere but each node is constrained to move on a single-dimensional great circle. Therefore, a conﬁguration in this case was a set of line segments (great circles) which were ﬁxed throughout the communication period, and the nodes moved in randomly only on these one-dimensional paths. Thus, the homogeneity assumption

268

Diversity in Communication: From Source Coding to Wireless Networks

in Grossglauser and Tse (2002) is now relaxed. In particular, there can be pairs of nodes that are far more likely to be in close proximity to each other than other pairs. For example, if two one-dimensional paths nearly overlap, the probability of close encounter between the nodes is signiﬁcantly larger than for two paths that are “far apart.” This lack of homogeneity implies, as shown in Diggavi et al. (2002b), that there are conﬁgurations where constant throughput is unattainable even with mobility. Since the capacity of such a mobile ad hoc network then depends on the constellation of one-dimensional paths, the question becomes one of scaling laws for a random conﬁguration. Therefore, the conﬁgurations themselves are chosen randomly with each one-dimensional path (great circle) chosen independently and with an identical uniform distribution. Given such a random conﬁguration, the question then becomes whether “bad” conﬁgurations (where the per S-D pair throughput is not Θ(1)) occur often. One of the key ideas in Diggavi et al. (2002b) was the identiﬁcation and proof of typical (“good”) conﬁgurations, on which the average long-term throughput per node is Θ(1). Intuitively the typical conﬁgurations deﬁned in Diggavi et al. (2002b) are those where the fraction of one-dimensional paths intersecting any given area is uniformly close to its expected number. That is, the empirical probability counts are uniformly close to the underlying probability of a random one-dimensional path intersecting that area. Therefore, even for a particular deterministically chosen conﬁguration which satisﬁes the typicality condition, the per S-D pair throughput is Θ(1). One of the main results in Diggavi et al. (2002b) is that if the one-dimensional paths are chosen (uniformly) randomly and independently, then for almost all constellations of such paths, the throughput per S-D pair is Θ(1). Therefore, for random conﬁgurations the probability of an atypical conﬁguration is shown to go to zero asymptotically in network size n. Thus, although each node is restricted to move in a one-dimensional space, the same asymptotic performance is achieved as in the case when they can move in the entire two-dimensional region. Theorem 9.11 Given a conﬁguration C, there exists a scheduling and relaying policy π and a constant c > 0 such that lim P {λ(n) = cR is feasible |C} = 1

n→∞

(9.29)

for almost all conﬁgurations C as n → ∞, i.e., the probability of the set of conﬁgurations for which the policy achieves a throughput of λ goes to 1 as n → ∞. Next we give a ﬂavor of the proof techniques used to prove theorem 9.11. First, we examine a relaying strategy where at each time, every node carries source packets, which originate from that node, and relay packets, which originated from other nodes and are to be forwarded to their ﬁnal destinations. In phase I, each sender attempts to transmit a source packet to its nearest receiver, who

9.4

Multi-user Diversity

269

Phase I

Source

Destination

Phase I & II Phase II

Phase II Phase I

Relay Relay

The relaying strategy for mobile nodes. In phase I, the source attempts to transfer packets to relays. During phase II, the relays attempt to transfer packets to the destination.

Figure 9.10

will serve as a relay for that packet. In phase II, each sender identiﬁes its nearest receiver and attempts to transmit a relay packet destined for it, if the sender has one (see ﬁg. 9.10). As in equation 9.26, a successful transmission of rate R occurs when the signal-to-interference-plus-noise ratio (SINR) is above a certain threshold β. Note that it can be shown that if the source nodes attempt to “wait” till it encounters its destination, the per S-D pair throughput cannot be Θ(1). Therefore every source spreads its traﬃc to random intermediate nodes depending on the mobility. Moreover, each packet is forwarded successfully to only one relay, i.e., there is no duplication. Mobility allows source-destination pairs to be able to relay information through several independent relay paths, since nodes have changing nearest neighbors due to mobility. This method of relaying information through independent attenuation links which vary over time is also a form of multiuser diversity. One can see this by observing that the transmission occurs over several realizations of the communication graph GC . The relaying strategy which utilizes mobility schedules transmissions over appropriate realizations of the graph. Conceptually, this use of independent relays to transmit information from source to destination is illustrated in ﬁg. 9.11, where the strategy of Theorems 9.10 and 9.11 is used. Intuitively, if the source is able to uniformly spread its traﬃc through each of its relays (see ﬁg. 9.11) then we can expect to obtain Θ(1) throughput per S-D pair. In order for this to occur, we need to show two properties: 1. Every node spends the same order of time as the nearest neighbor to Θ(n) other nodes. This ensures that each source can spread its packets uniformly across Θ(n) other nodes, all acting as relays, and these packets can in turn be merged back into their respective ﬁnal destinations.

270

Diversity in Communication: From Source Coding to Wireless Networks

2. When communicating with the nearest neighbor receiver, the capture probability is not vanishingly small even in a large system, even though there are Θ(n) interfering nodes transmitting simultaneously.

throughput-delay trade-oﬀ

However, with one-dimensional mobility, it is shown in Diggavi et al. (2002b) that there exist conﬁgurations where these properties cannot be satisﬁed. This is where the identiﬁcation of typical conﬁgurations becomes important. For typical conﬁgurations through a detailed technical argument it is shown in Diggavi et al. (2002b) that these properties hold. Moreover, for randomly chosen conﬁgurations, it is shown that such typical conﬁgurations occur with probability going to 1 asymptotically in n. Therefore, using these components, the proof of theorem 9.11 is completed. There is a dramatic gain in the per S-D pair throughput in theorems 9.10 and 9.11 over theorem 9.9 from O( √1n ) to Θ(1). A natural question to ask is whether there is a cost to this improvement. The results in theorems 9.10 and 9.11 utilized node mobility to deliver the information from source to destination. Therefore, the timescale over which this is eﬀective is dependent on the velocity of the nodes, which determines the rate of change of the topology. Hence we can expect there to be signiﬁcantly larger packet delays for this scheme as compared to the ﬁxed network. In some sense, the Gupta-Kumar result in theorem 9.9 has a smaller throughput, but also has a smaller packet delay, since the delays depend on successful packet transmissions over the route and not the change in node topology. Hence a natural question to ask is whether there exists a fundamental trade-oﬀ between delay and throughput in ad hoc networks. This question was recently studied in El Gamal et al. (2004), where the authors quantiﬁed this trade-oﬀ. In order to quantify the trade-oﬀ there needs to be a formal deﬁnition of delay. In El Gamal et al. (2004) delay D(n) is deﬁned as the sum of the times spent in every relay node. This deﬁnition does not include the queueing delay at the nodes, just the delay incurred in successful transmission of the packet on each single hop of the route. Given this deﬁnition of delay, El Gamal et al. (2004) established that for a ﬁxed random network of n nodes, the delay-throughput trade-oﬀ for network, when λ(n) = O(1/ n log(n)) is D(n) = Θ(nλ(n)). For a mobile ad hoc √ n λ(n) = Θ(1), El Gamal et al. (2004) showed that D(n) = Θ( v(n) ), where v(n) is the velocity of the mobile nodes. Therefore, this quantiﬁes the cost of higher throughput in mobile networks. The theoretical developments in sections 9.4.1 and 9.4.2 indicate the strong interactions between the physical layer coding schemes and channel conditions and the networking issues of resource allocation and application design. This is an important insight we can draw for the design of wireless networks. Therefore, several problems which are traditionally considered as networking issues and are typically designed independent of the transmission techniques need to be reexamined in the context of wireless networks. As illustrated, diversity needs to be taken into account while solving these problems. Such an integrated approach is a major lesson learned from the theoretical considerations, and we develop another aspect of this through the study of source coding using route diversity in section 9.5.

9.5

Route Diversity

271

RELAY NODES

SOURCE

DESTINATION

Multiple routes Multiple routes

DIRECT PATH

PHASE I

Figure 9.11

9.5

PHASE II

Multi-user diversity through relays.

Route Diversity The interest in section 9.4.2 was the characterization of long-term throughput from source to destination. However, in applications such as sensor networks (see, for example, Pottie and Kaiser, 2000; Pradhan et al., 2002, and references therein), there could be node failures which lead to routes being disconnected through a transmission period. This might become particularly crucial when there are strong delay constraints, such as those in real-time data delivery. Such route failures can also occur in ad hoc networks (discussed in section 9.4.2) as well as in wired networks. In multihop relay strategies, we could utilize the existence of multiple routes from source to destination in order to increase the probability of successfully receiving the information at the destination within delay constraints despite route (path) failures. This is a form of route diversity (see ﬁg. 9.12) and was ﬁrst suggested by Maxemchuk (1975) in the context of wired networks. Note that in a broad sense, the multi-user diversity studied in mobile ad hoc networks in section 9.4.2 also utilizes the presence of multiple routes from source to destination. However, in that case the multiple routes were utilized to increase the long-term per S-D pair throughput. In the topic of this section we will utilize the multiple routes for low-delay applications. We will examine this problem in the context of delivering a real-time source (like speech, images, video, etc.) with tight delay constraints. If the same information about the source is transmitted over both routes, then this is a form of repetition coding. However, when both routes are successful, there is no performance advantage. Perhaps a more sophisticated technique would be to send correlated descriptions of the source in the two routes such that each description is individually good, but they are diﬀerent from one another so that if both routes are successful one gets a better approximation of the source. This is the basic idea behind multiple description (MD) source coding (El Gamal and Cover, 1982). This notion can be

272

Diversity in Communication: From Source Coding to Wireless Networks

Side Information

Description 1 Source Encoder

Route 1

Route 2

Destination

Description 2

Source sequence

Figure 9.12

Route diversity.

extended to more than two descriptions as well, but in this section we will focus on the two-description case for simplicity. The idea is that the source is coded through several descriptions, where we require that performance (distortion) guarantees can be given to any subset of the descriptions and the descriptions mutually reﬁne each other. This is the topic discussed in sections 9.5.1 and 9.5.2. In a packet-based network such as the Internet, packet losses are inevitable due to congestion or transmission errors. If the data does not have stringent delay constraints, error recovery methods typically ensure reliability either through a repeat request protocol or through forward error correction (Keshav, 1997). Another technique is through scalable (or layered) coding techniques which send a lower-rate base layer or coarser description of the source and send reﬁnement layers to enhance the description. Such a technique is again dependent on reliable delivery of the base layer, and if the base layer is lost, the enhancement layers are of no use to the receiver. Therefore, such layered techniques are again inherently susceptible to route failures. These arguments reemphasize the need to develop multiple description (MD) source coding schemes. Note that the layered coding schemes form a special case of such as MD coding scheme, where guarantees of performance are not given for individual layers, but the layers reﬁne the coarser description of the source. An important application for future wireless networks could be real-time video. There has been signiﬁcant research into robust video coding in the presence of packet errors (Reibman and Sun, 2000). The main problem that arises in video is that the compression schemes typically have motion compensation, which introduces memory into the coded stream. Therefore, decoding the current video frame requires the availability of previous video frames. If previous frames are corrupted or lost, the decoder is required to develop methods to conceal such errors. This is an active research topic especially in the context of wireless channels (Girod and Farber, 2000). However, an appealing approach to this problem might be through route diversity and MD coding, and this is brieﬂy discussed in section 9.5.1.

9.5

Route Diversity

273

9.5.1

rate-distortion function

Multiple Description (MD) Source Coding

In order to formalize the requirement of the MD source coder, we study the setup shown in ﬁg. 9.13. As mentioned earlier, we will illustrate the ideas using only the two-description MD problem. Given a source sequence {X(k)}, we want to design an encoder that sends two descriptions at rate R1 and R2 over the two routes such that we get guaranteed approximations of the source when either route fails, or when both succeed. In section 9.5.2 we develop techniques that achieve such an objective. In order to understand the fundamental bounds on the performance of such techniques, we need to examine the problem from an information-theoretic point of view. The main tool to do this is given in rate-distortion theory (Cover and Thomas, 1991). This theory describes fundamental limits of the trade-oﬀ between the rate of the representation of a source and the quality of the approximation. Not surprisingly, the origins of this theory are in Shannon (1948, 1958b). In order to give some of the basic ideas, we ﬁrst make a short digression on the rudiments of this theory. Given a source sequence X T = {X(1), . . . , X(T )} from a given alphabet X , the source encoder needs to describe it using R bits per source sample (i.e., with a total of RT bits for the sequence). Equivalently we map the source to the index set J = {1, . . . , 2RT }. The goal is that given this description a decoder ˆT = is able to approximately reconstruct the source sequence by the sequence X ˆ ˆ {X(1), . . . , X(T )}. This is accomplished by constructing a function f : J → Xˆ T , and Xˆ is the alphabet over which the reconstruction is done. Common examples for the alphabet are X = R = Xˆ , or the binary ﬁeld. The distortion measure ˜ T,X ˆ T ) quantiﬁes the quality of the approximation between the reconstructed d(X and original source sequence. Typically, the distortion measure is a single-letter function constructed as T ˜ T,X ˆT ) = 1 ˆ d(X(i), X(i)), d(X T i=1

(9.30)

ˆ denotes the quality of the approximation for each sample. Common where d(X, X) ˆ = |X − X| ˆ 2 and Hamming distance (Cover and Thomas, examples are d(X, X) 1991). The simplest framework to give performance bounds is to analyze the performance of a source encoder for an independent and identically distributed random source sequence. Typically, the interest is in the average distortion over the set of input sequences, for the given probability distribution associated with the source ˜ T,X ˆ T )], and the problem besequence. Therefore, the average distortion is E[d(X comes one of quantifying the smallest rate R that be used to describe the source with average ﬁdelity D, asymptotically in the block length T . This is called the rate-distortion function R(D) and can be given an operational meaning by proving that there exist source codes that can achieve this fundamental bound (Cover and

274

Diversity in Communication: From Source Coding to Wireless Networks

Thomas, 1991). The central result in single source rate-distortion theory is that R(D) is characterized as R(D) =

min

p(ˆ x|x):E[d(x,ˆ x)]≤D

ˆ I(X; X),

(9.31)

ˆ represents the mutual information between X and X ˆ where, as before, I(X; X) (Cover and Thomas, 1991). A simple instantiation of this result is the special case where we want D = 0, i.e., the lossless case. In this case, one can see that R(0) = H(X), where H(X) is the entropy of the source. Another important special case is when the source sequence comes from a Gaussian distribution, X ∼ N (0, σx2 ), and ˆ = |X − X| ˆ 2. we are interested in the squared error distortion metric, i.e., d(X, X) 2 σ In this case, equation 9.31 evaluates to R(D) = 12 log Dx for D ≤ σx2 and zero otherwise. Another way of writing this is in terms of the distortion-rate function D(R), which characterizes the smallest distortion achievable for a given rate. In the Gaussian case we see that D(R) = σx2 2−2R . We will interchangeably consider these two quantities. The result in equation 9.31 guarantees only that the average distortion does not exceed D. However, under some regularity conditions, the rate-distortion function remains the same even when we require that the probability of the distortion ˜ T,X ˆ T ) exceeding D to go to zero (Berger, 1977; Cover and Thomas, 1991). d(X The characterization of the rate-distortion function given in equation 9.31 has also been extended in many other ways including sources with memory (Cover and Thomas, 1991). Armed with this background, we can now formulate the question on the fundamental rate-distortion bounds on multiple description (MD) source coding. The multiple description source encoder needs to produce two descriptions of the source using R1 , R2 bits per source sample respectively. We can formally describe ˆ 2 (k)}, {X ˆ 12 (k)} use ˆ 1 (k)}, {X the problem by requiring that the reconstructions {X these descriptions to approximately reconstruct the source (see ﬁg. 9.13). As in the “single description” case, we accomplish this by constructing functions f1 : J1 → Xˆ T , f2 : J2 → Xˆ T , f12 : J1 × J2 −→ Xˆ T ,

(9.32)

where Ji = {1, . . . , 2Ri T }, i = 1, 2, and Xˆ is the alphabet over which the reconstruction is done. We want the approximations to give average ﬁdelity guarantees of T ˜ T,X ˜ T,X ˜ T,X ˆ 1T )] ≤ D1 , E[d(X ˆ 2T )] ≤ D2 , E[d(X ˆ 12 )] ≤ D12 . E[d(X

(9.33)

The rate-distortion question in this context is to characterize the bounds on the tuple (R1 , D1 , R2 , D2 , D12 ). Therefore, we are interested in characterizing the achievable rate-distortion region described by the tuple (R1 , D1 , R2 , D2 , D12 ). As can be seen, this seems like a much more diﬃcult question than the singledescription problem for which there is a complete characterization. As a matter of fact, the complete characterization of the MD rate region is still an open question. This problem was formalized in 1979, and in El Gamal and Cover (1982), a

9.5

Route Diversity

275

Description 1 mnR1 SOURCE mn{X(k)}

Multiple Description Source coder

Figure 9.13

Joint Decoder

mnR2

Description 2

Decoder 1

Route 1

Route 2

ˆ 1 (k)} mn{X

ˆ 12 (k)} mn{X

Decoder 2

ˆ 2 (k)} mn{X

Multiple description (MD) source coding.

theorem was proved which demonstrated a region of the tuple (R1 , D1 , R2 , D2 , D12 ) for which MD source codes exist. Theorem 9.12 (El Gamal and Cover, 1982) Let X(1), X(2), . . . be a sequence of i.i.d. ﬁnite alphabet random variables drawn according to a probability mass function p(x). If ˆm ), m = 1, 2, 12 then an achievable rate region the distortion measures are dm (x, x for tuples (R1 , R2 , D1 , D2 , D12 ) is given by the convex hull of the following. ˆ 1 ), R2 ≥ I(X; X ˆ 2 ), R1 + R2 ≥ I(X; X ˆ 12 , X ˆ1, X ˆ 2 ) + I(X ˆ1; X ˆ 2 ) (9.34) R1 ≥ I(X; X ˆ t )] ≤ ˆ2 , x ˆ12 ) such that E[dt (X, X for some probability mass function p(x, x ˆ1 , x Dt , t = 1, 2, 12. This region was further improved in Zhang and Berger (1987) to a larger region for which MD source codes exist. However, what is unknown is whether these characterizations completely exhaust the set of tuples that can be achieved, i.e., a converse for the MD rate-distortion region. There are some special cases for which there are further results (Ahlswede, 1985; Fu and Yeung, 2002, and references therein). There has also been recent work on achievable rate-regions for more than two descriptions (Pradhan et al., 2004; Venkataramani et al., 2003). However, in these cases as well the complete characterization is unknown. The only case for which the MD region is completely characterized is that for memoryless Gaussian sources with squared error distortion measures and speciﬁcally for two descriptions.12 In Ozarow (1980), it was shown that the two-description MD region given in El Gamal and Cover (1982) was also applicable to the Gaussian case with squared error distortion where the alphabet is not ﬁnite. Moreover it was shown that the region in theorem 9.12 was in fact the complete characterization by proving a converse (outer bound) to the rate region. In this context the source was modeled as a sequence of i.i.d. Gaussian random variables X ∼ N (0, σx2 ) and the squared error distortion measure was chosen, i.e., ˆm ) = |x − x ˆm |2 , m = 1, 2, 12. Therefore, specializing the result in theorem dm (x, x 9.12 to the Gaussian case yields the following complete characterization of the set

276

Diversity in Communication: From Source Coding to Wireless Networks

of all achievable tuples (R1 , R2 , D1 , D2 , D12 ) (El Gamal and Cover, 1982; Ozarow, 1980): D1 ≥ σx2 e−2R1 , D2 ≥ σx2 e−2R2 , σx2 e−2(R1 +R2 ) D12 ≥ =

2 .

= D1 D2 D1 D2 −2(R +R ) 1 2 1 − σ2 1 − σ2 − −e 1− σ2 σ2 x

x

x

(9.35)

x

In order to interpret this result, consider the following. As seen before, for a single-description Gaussian problem, the minimum distortion for a given rate is D(R) = σx2 2−2R . Therefore, the distortions D1 , D2 clearly need to be governed by the single-description bound, and this explains the ﬁrst two inequalities in equation 9.35. However, in the MD problem we also need to bound the distortion D12 when both descriptions are available. From the single-description bound it is clear that we would have D12 ≥ D(R1 + R2 ) = σx2 2−2(R1 +R2 ) . Therefore, a natural question is whether this bound on D12 can be achieved with equality. However, the result in theorem 9.12 shows that this is not possible unless D1 = σx2 or D2 = σx2 . Here is where the tension between the two descriptions manifests itself. We examine the tension in the symmetric case, when we have D1 = D2 = D, R1 = R2 = R and specialize it for the unit variance source σx2 = 1. If we want the individual descriptions to be as eﬃcient as possible (i.e., D = e−2R ), then we see that D , which is far larger than D(R1 + R2 ) = e−2(R1 +R2 ) = D2 . For small D, D12 ≥ 2−D 2 we see that D12 is approximately D 2 , which is much larger than D . Therefore, if we ask that the individual descriptions be close to optimal themselves, then they do not mutually reﬁne each other very well. This reveals the tension between getting small the distortions D1 , D2 of individual descriptions and a small D12 . We need to make the individual descriptions coarser in order to get more mutual reﬁnement in D12 . One important real-time application is that of video coding. This can be viewed as a sequence of individual frames which are correlated to each other. The traditional way of encoding video is by describing the “current” frame diﬀerentially with respect to the previous frame. This is done through a block-matching technique where the “closest” (in terms of squared distance) blocks from the previous frame are matched to blocks in the current frame, and then only the diﬀerences are transmitted. The rationale behind this idea is that blocks are only relatively displaced due to motion of objects in the video and hence this mechanism is called motion compensation in the literature (Reibman and Sun, 2000). Note that in this scheme, the encoder explicitly uses the knowledge of the previous frame. Clearly, when there are packet/route errors and the previous frame is not received at the destination, the reconstruction is diﬃcult since the previous reference frame is not available. Therefore, several ﬁxes to this problem have been developed over the past two decades (see Girod and Farber, 2000, and references therein). In a more abstract framework, we can think of the video as a sequence of correlated random variables which we are trying to describe eﬃciently. In Witsenhausen and Wyner an alternate approach was taken by considering the video

9.5

Route Diversity

277

coding problem as a source coding problem with side information. In this setting, after encoding and transmitting the “previous” frame, the “current” frame develops an encoder which does not explicitly depend on the knowledge of the previous frame. The basic idea of this scheme arises from encoding schemes and decoding described in Slepian and Wolf (1973) and Wyner and Ziv (1976). Since the encoder does not explicitly use the side-information (previous frame) it can be designed such that the computational complexity is shifted from the encoder to the decoder. Such an architecture is attractive for applications where the encoder needs to be simple but the decoder can be more complex. This idea has been developed comprehensively in Puri and Ramchandran (2003), where practical coding techniques are developed with such applications in mind.

Decoder 1

f1 (s, i(1) ) i(1) ∈ I (1)

ˆ 1 (k) X

Route 1 Source

Decoder 12

ENCODER

i(2) ∈ I (2)

{X(k)}

f12 (s, i(1),i(2) )

ˆ 12 (k) X

Route 2

{S(k)}

Switch for encoder SI

Side information (SI)

Decoder 2

f2 (s, i(2) ) Figure 9.14

ˆ 2 (k) X

Multiple description source coding with side information.

However, even with this idea the robustness to route failures which is inherent to MD coding is not captured. Motivated by this, Diggavi and Vaishampayan (2004) considered the MD problem with side information (see ﬁg. 9.14). In this abstract setting, we want to encode a source {X(k)} when the decoder has knowledge of a correlated process {S(k)} as side-information. For example, in the setting of Witsenhausen and Wyner and Puri and Ramchandran (2003), the side information could be the previous frame. In order to describe the source in the presence of route diversity, we can pose an MD problem, but now with side information as shown in ﬁg. 9.14. Clearly this is a generalization of the MD problem and an achievable rate region was established for this problem in Diggavi and Vaishampayan (2004). Theorem 9.13 Let (X(1), S(1)), (X(2), S(2)) . . . be drawn i.i.d. ∼ Q(x, s). If only the decoder has access to the side information {S(k)}, then (R1 , R2 , D1 , D2 , D12 ) is achievable if there exist random variables (W1 , W2 , W12 ) with probability mass function p(x, s, w1 , w2 , w12 ) = Q(x, s)p(w1 , w2 , w12 |x), that is, S ↔ X ↔ (W1 , W2 , W12 )

278

Diversity in Communication: From Source Coding to Wireless Networks

form a Markov chain, such that R1 > I(X; W1 |S), R2 > I(X; W2 |S)

(9.36)

R1 + R2 > I(X; W12 , W1 , W2 |S) + I(W1 ; W2 |S) and there exist reconstruction functions f1 , f2 , f12 which satisfy D1 ≥ E[d1 (X, f1 (S, W1 ))], D2 ≥ E[d2 (X, f2 (S, W2 ))]

(9.37)

D12 ≥ E[d12 (X, f12 (S, W12 , W1 , W2 ))]. This result gives an achievable rate region, but the complete characterization for this problem is open. A slightly improved region to theorem 9.13 is also found in Diggavi and Vaishampayan (2004). However, it is unknown whether this region exhausts the achievable rate region. But for the case when both the source and the side information are jointly Gaussian, and we are interested in the squared error distortion, a complete characterization of the rate-distortion region was obtained in Diggavi and Vaishampayan (2004). In more detail, the result was the following. Let (X(1), S(1)), (X(2), S(2)) . . . be a sequence of i.i.d. jointly Gaussian random variables. With no loss of generality this can be represented by S(k) = α [X(k) + U (k)] ,

(9.38)

where α > 0 and {X(k)}, {U (k)} are independent Gaussian random variables with 2 2 , E[U 2 ] = σU . As considered in theorem 9.13, only E[X] = 0 = E[U ], E[X 2 ] = σX the decoder has access to the side information {S(k)}. If the distortion measures are ˆm ) = ||x − x ˆm ||2 , m = 1, 2, 12 then it is shown in Diggavi and Vaishampayan dm (x, x (2004) that the set of all achievable tuples (R1 , R2 , D1 , D2 , D12 ) are given by 2 −2R1 D1 > σF e ,

2 −2R2 D2 > σF e ,

D12 >

2 −2(R1 +R2 ) e σF , ˜ 2 ˜ − Δ) 1−( Π

(9.39)

σ2 σ2

2 ˜ Δ ˜ are given by = σ2X+σU2 and Π, where σF X U D2 D2 ˜ = 1 − D1 ˜ = D1 Π 1 − , Δ − e−2(R1 +R2 ) . 2 2 2 2 σF σF σF σF

(9.40)

The result in equation 9.39 also shows that the rate-distortion region in this case is the same as that achieved when both encoder and decoder have access to the side information. That is, in the Gaussian case, the rates that can be achieved are the same whether the switch in ﬁg. 9.14 is open or closed. In Wyner and Ziv (1976) it was shown that in the single-description Gaussian case, the decoder-only side information rate-distortion function coincided with that when both encoder and decoder were informed of the side information. The result (eq. 9.39) establishes that this is also true in the Gaussian two-description problem with decoder side information. However, the encoding and decoding techniques to achieve these rate tuples are very diﬀerent when the encoder has access to the side information than

9.5

Route Diversity

279

when it does not. This shows that there might be eﬃcient mechanisms to construct MD video coders which are robust to route failures. Some of the code constructions that bring this idea to fruition are discussed in section 9.5.2. 9.5.2

Quantizers for Route Diversity

The results given in section 9.5.1 show the existence of codes that can achieve the rate tuples given in theorems 9.12 and 9.13, but there are no explicit constructions. In this section we explore explicit coding schemes which utilize the presence of route diversity. As seen in section 9.5.1, the single-description rate-distortion function quantiﬁes the fundamental limits of the trade-oﬀ between the rate of the representation of a source and its average ﬁdelity. The result in equation 9.30 showed the existence of such codes. Explicit constructions of these codes are called quantizers (Gersho and Gray, 1992; Gray and Neuhoﬀ, 1998). More formally, quantizers map a sequence {X(1), . . . , X(T )} of source samples into a “representative” reconstrucˆ ˆ )} through an explicit mapping which is typically computation {X(1), . . . , X(T tionally eﬃcient. Scalar quantizers operate on a single source sample X(k) at a time. Most current systems use scalar quantizers (Jayant and Noll, 1984). However, rate-distortion theory tells us that using sequences is important, and hence vector quantizers use sequences of source samples, i.e., T > 1 for quantization. Quantization techniques for single description have been quite well studied and understood (Gersho and Gray, 1992; Gray and Neuhoﬀ, 1998; Jayant and Noll, 1984). The rudiments of the MD coding ideas arose in the 1970s at Bell Laboratories. Jayant (1981) proposed and analyzed a very simple idea of channel splitting. The basic idea was to oversample a speech signal and send the odd samples through one channel and the even ones through another. However, this technique is not very eﬃcient in terms of rate. Many of such simple coding techniques were being considered at Bell laboratories, but the ideas were not archived. These questions actually motivated the information-theoretic formulation of the MD problem described in section 9.5.1. The systematic study of coding for multiple descriptions was initiated in Vaishampayan (1993). Its publication resulted in a spurt of recent activity on the topic (see, for example, Diggavi et al., 2002c; Goyal and Kovacevic, 2001; Ingle and Vaishampayan, 1995; Vaishampayan et al., 2001, and references therein). More recently the utility of MD coding in conjunction with route diversity has also created interest in the networking community (see Apostolopoulos and Trott, 2004, and references therein). The basic idea introduced in Vaishampayan (1993) constructed scalar quantizers for the MD problem. This was done speciﬁcally for the symmetric case, where D1 = D2 and R1 = R2 . This symmetric construction was extended to structured (lattice) vector quantizers in Vaishampayan et al. (2001). The symmetric case has been further explored by several other researchers (Goyal and Kovacevic, 2001; Ingle and Vaishampayan, 1995). The importance of structured quantizers is in the computational complexity of the source encoder. For example, just as in channel

280

Diversity in Communication: From Source Coding to Wireless Networks

Λ

Multiple Description Quantizer

SOURCE

QUANTIZER Lattice Λ

λ

Multiple Description Labeling Λ1 Λ2

DESIGN STRUCTURE

Figure 9.15

λ1 λ2

Λ1

Λ2 Λ

lcm

Λs

Structure of multiple description quantizer.

coding, trellis-based structures are also important in source coding. Such structures have also been proposed for the symmetric MD problem (Buzi, 1994; Jafarkhani and Tarokh, 1999). In general, unstructured quantizers based on training on some source samples can also be constructed, but the computational complexity of such techniques is much higher than structured (lattice) quantizers and therefore they are less attractive in practice. Such unstructured quantizers have been considered in the literature (Fleming et al., 2004). Our focus in this chapter will be on structured quantizers for which we have computationally eﬃcient encoders as well as techniques to analyze their performance. In general we would like to design MD quantizers that can attain an arbitrary rate-distortion tuple, and not just the symmetric case. This is motivated by applications where the multiple routes have disparate capacities (and therefore rate requirements) as well as diﬀerent probabilities of route failures. In these cases, we need to design asymmetric MD quantizers which give graceful degradation in performance with route failures. Such a structure was studied in Diggavi et al. (2002c), and is depicted in ﬁg. 9.15. We illustrate the ideas of MD quantizer design from Diggavi et al. (2002c), using a scalar example. In ﬁg. 9.16, the ﬁrst line represents a uniform scalar quantizer. If we take a single source sample X(k) ∈ R , then the uniform quantizer ˆ on the one-dimensional maps this sample to the closest “representative” point X (scaled integer) lattice Λ. Loosely, a T -dimensional lattice is a set of regularly spaced points in R T for which any point can be chosen as the origin and the set of points would be the same. A more precise notion is based on the set of points forming an additive group (Conway and Sloane, 1999). Each of the representative points is given a unique label λ and this label is transmitted to the receiver. The transmission rate depends on the number of labels. Typically a ﬁnite set of points 2M is used to represent the labels. In a straightforward manner, this translates to a rate of log(2M ) bits per source sample. If the source either has ﬁnite extent or a ﬁnite second-order moment, such a quantizer would have a bounded squared error distortion. If the representative points are separated by a distance of Δ, then the worst-case squared error distortion between a source sample and the representative 2 , (M +1)Δ ]. For a uniform distribution of is Δ4 for source samples X(k) ∈ [− (M +1)Δ 2 2 2 (M +1)Δ the source in the region X(k) ∈ [− (M +1)Δ , ], the average distortion is Δ 2 12 2

9.5

Route Diversity

281

mn −3Δ 2

mnΛ

mn −Δ 2

ˆ mnX mnΛ1

mn −3Δ 2

ˆ1 mnX

−3

−2

−1 mn −5Δ 2

mnΛ2

−2

mn Δ 2

mnλ mn 3Δ 2

mnλ1

mn 5Δ 2

1

0 mn Δ 2

1

2

mn(λ1 , λ2 )

( 2, 2) ( 3, 2)

0 mn Δ 2

(−1,−1)

(−1,0)

( 0 ,−1)

(1 , 0) ( 0, 0)

3

mn 7Δ 2

mnλ2

−1 ˆ2 mnX

2

(1 ,1)

Scalar quantizer labeling example. The top line is a uniform scalar ˆ . The quantizer that maps source points X(k) to a set of discrete representatives X second and third lines show coarser uniform scalar quantizers. The last line puts together the combination of the coarser quantizers to give an ordered pair (λ1 , λ2 ) as a label to every lattice point λ in the ﬁne quantizer. Figure 9.16

(Gersho and Gray, 1992). The mapping described above is a single-description uniform scalar quantizer. The MD scalar quantizer needs to map every source sample to an ordered pair of ˆ 2 ). The labels (λ1 , λ2 ) of this pair are used to send ˆ1, X representation points (X information over the two routes. For example, we could send the label λ1 over the ﬁrst route and label λ2 over the second route. Now, in ﬁg. 9.16 we have illustrated this by choosing coarser scalar quantizers in the second and third lines for the ˆ 2 respectively. These quantizers are also one-dimensional ˆ 1 and X representations X ˆ1, X ˆ 2 in themselves give lattices Λ1 and Λ2 respectively. These representations X coarser information about the source sample, i.e., have a larger distortion than the “ﬁner” quantizer Λ shown in the ﬁrst line. Now, we need to represent the source sample X(k) by a pair of representation points from Λ1 and Λ2 . We want to choose this pair in such a way that if either of the labels is lost due to route failure, then we are still guaranteed a certain distortion. However, if both labels are received, i.e., both routes are successful, then we need to get a smaller distortion. This means that the label pair have to mutually reﬁne each other’s representations. One such labeling technique is illustrated in ﬁg. 9.16. Each point in the coarser ˆ 2 in Λ1 is given a label λ1 and λ2 respectively. The idea ˆ 1 in Λ1 and X lattices X is then to give a pair of labels (λ1 , λ2 ) to each of the points on the ﬁne lattice Λ. Every lattice point in Λ gets a unique label pair (λ1 , λ2 ). Once this labeling function is constructed, then we can form the multiple description (MD) scalar quantizer by doing the following two steps. First, reduce the source sample X(k) ∈ R to its

282

Diversity in Communication: From Source Coding to Wireless Networks

ˆ with label λ, i.e., apply a uniform scalar quantizer closest representative in Λ, X ˆ and the labeling function, we know the pair (λ1 , λ2 ) that to X(k). Given this X, ˆ The second step is to associate X ˆ 1 with the reconstruction given by represents X. ˆ 2 in Λ2 . These the label λ1 in the ﬁrst coarse quantizer Λ1 , and similarly for X operations are what the structure in ﬁg. 9.15 represents. Therefore, in this design, the main task is to construct the labeling function for each point in Λ. Given the label pair (λ1 , λ2 ), the encoder sends the index associated with λ1 on route 1 and the index for λ2 on route 2. Before describing the labeling function, we examine the decoder structure in the MD scalar quantizer described above. First recall that the labeling function is designed so that any particular pair (λ1 , λ2 ) is uniquely associated with a particular λ. Therefore, if both routes succeed, then the receiver is able to reconstruct λ and as ˆ This means that the distortion in this case is that associated with a consequence X. 2 the ﬁne quantizer Λ, i.e., the average ﬁdelity is Δ 12 . Now suppose route 1 succeeds and route 2 fails, then the receiver has only λ1 and does not know λ2 . For example, suppose in ﬁg. 9.16, the label pair (−1, 0) was chosen at the source encoder, i.e., λ1 = −1, λ2 = 0. Now, the receiver knows that the encoder was trying to send one of the two points (−1, 0) or (−1, −1) and since route two failed it does not know which. More generally, in this situation, the receiver knows that λ belongs to the set of points in Λ which have the same ﬁrst label λ1 but have diﬀerent second label ˆ 1 associated with label λ2 . Now, assume that the decoder uses the reconstruction X λ1 = −1 in Λ1 (see second line in ﬁg. 9.16). Therefore, for this particular example, the worst-case error due to this choice is 94 Δ2 . This example also shows that the labeling function directly aﬀects the decoder distortion. The design of the labeling ˆ 1 can use function is the central part of the MD quantizer. The reconstruction X the mean of the set of all points in Λ associated with the same ﬁrst label λ1 , which may improve the distortion. Note that in general this might not coincide with the reconstruction associated with λ1 . For design simplicity this reconstruction need not be taken into account in designing the labeling function, but rather can be used only at the decoder to improve the ﬁnal distortion. In general, we would need to construct a labeling function for all the points in Λ. However, we describe a particular design which solves a smaller problem and then expands its solution to Λ (Diggavi et al., 2002c; Vaishampayan et al., 2001). We will illustrate this idea using the example shown in ﬁg. 9.16. In the last line of ﬁg. 9.16, we have depicted the overlay of the two coarse onedimensional lattices Λ1 , Λ2 along with Λ. We see that there is a repetitive pattern after every six points in Λ. This is not a coincidence, because Λ1 was formed by taking every second point in Λ and Λ2 by taking every third. The least common multiple is 6 and therefore we would expect the pattern to repeat. The basic idea is to just form a labeling function for these six points and then “shift” these labels to tile the entire lattice Λ. For example, in ﬁg. 9.16, consider the point which we have labeled as (2, 2) on the last line. This was done in the following manner. Notice that the repeating pattern of six points can be anchored by the points where both the Λ1 and Λ2 points coincide. In ﬁg. 9.16, these are the points which have overlapped

9.5

Route Diversity

283

circles on the last line. We can think of all points in Λ with respect to these anchor points. For example, the point labeled (2, 2) is one point to the left of such an overlap point and is “equivalent” to the point labeled (−1, 0). More precisely, it is in the same coset as the other point with respect to the “intersection” lattice Λs , which is formed by the anchor points. Therefore, we get the label by shifting the label of (−1, 0) with respect to its cosets. In this case, note that λ1 = −1 in Λ1 is two points to the left of the anchor point (0, 0). Therefore, the corresponding point with respect to the anchor point (3, 2) is λ1 = 2 and hence the ﬁrst label for the point of interest is λ1 = 2. Next, the corresponding point of the label λ2 = 0 in Λ2 with respect to the anchor point (3, 2) is λ2 = 2. This gives us the label (2, 2) which is shown in the ﬁg. 9.16. In a similar manner, given the labeling for the six points, we can construct the labeling for all points in Λ by the shifting technique described above. Actually, the six points correspond to the discrete Voronoi region of the point (0, 0) of the intersection lattice of the anchor points. Therefore, we can focus on constructing labels for the points in the Voronoi region of the intersection lattice. Note that in the example of ﬁg. 9.16, the intersection lattice had an index of six which is exactly the least common multiple of the indices of lattices Λ1 , Λ2 in Λ. This is also true when the indices of Λ1 , Λ2 in Λ are not coprime (Diggavi et al., 2002c). Let VΛs :Λ (0) be deﬁned as the Voronoi region of the intersection lattice. Our problem is to develop the labeling function for the points in VΛs :Λ (0) in order to satisfy the individual distortion constraints D1 , D2 . This is accomplished by using a Lagrangian formulation in Diggavi et al. (2002c). This formulation reduces to ﬁnding the labeling scheme α(λ) = (α1 (λ), α2 (λ)) so as to minimize, 0 1 γ1 λ − α1 (λ) 2 + γ2 λ − α2 (λ) 2 . (9.41) λ∈VΛs :Λ (0)

For this minimization problem we need to choose the appropriate labels (α1 (λ), α2 (λ)) = (λ1 , λ2 ). This is done by observing the following identity. γ1 λ − λ1 2 + γ2 λ − λ2 2 =

γ1 γ2 γ1 λ1 + γ2 λ2 2 λ2 − λ1 2 + (γ1 + γ2 ) λ − . γ1 + γ2 γ1 + γ2

This results in the following design guideline. The labeling problem is split into two parts: (1) Choose |VΛs :Λ (0)| “shortest” pairs (λ1 , λ2 ) (not all pairs of (λ1 , λ2 ) are used). (2) Assign these pairs to lattice points λ ∈ VΛs :Λ (0). The second design can be solved very eﬃciently using linear programming methods. The solution of this labeling problem illustrates an important feature of the MD quantizer design that is quite distinct from the single-description case. It can happen that particular labels of each description can be noncontiguous, i.e., not all points λ which get the same label—say, λ1 —need to occur contiguously. This is quite diﬀerent from the single-description case, where the labels are assigned to contiguous intervals. Also, the labels generated in this systematic manner are nontrivial and diﬃcult to handcraft.

284

Diversity in Communication: From Source Coding to Wireless Networks

11

6

14 14,7

4 8

4,11

3 7,3

13,3 3,7

13 7

7

15 15,4

2

4,8

4,3

7,0

3,3

4,0

0,3

3,0

4 16,8

0

7,7

12

4,4

8,8 8 4 8,4

3,2

8,0

0,4

0 0,0

0,2

6,0

1,4

12 0,1

1,0

2,2

2,0

1

6,6

12,6

6 1,5

−2

1,1

2,1

5,0

2,6

11,2

2 5

9,1

5,5

1 5,1

6

2,9

5

11 10,5

−4

Figure 9.17

10

2 6,2

0

16 1,12

−6 −6

3,10

3

9 10 9 −4

−2

0

2

4

6

Labels for a two-dimensional integer lattice example.

The labeling scheme described for the scalar quantizer actually illustrates a more general principle which is applicable to MD vector quantizers (Diggavi et al., 2002c). We use a chain of lattices as illustrated in ﬁg. 9.15, i.e., we use a ﬁne lattice Λ and two coarser sublattices Λ1 , Λ2 . These lattices have an intersection lattice Λlcm one of whose Voronoi regions is what we label. The idea of using sublattice shifts as done above to generate the labels using only the labels of this Voronoi region can also be generalized (Diggavi et al., 2002c). One such example of the labels of the Voronoi region for a two-dimensional lattice is shown in ﬁg. 9.17. Therefore, the vector quantizer proceeds as follows. We ﬁrst reduce point X T ∈ R T using a ﬁne lattice Λ, and then using the labeling function we ﬁnd (λ1 , λ2 ). Then as before λ1 is sent over the ﬁrst route and λ2 is sent over the second route. The decoder also proceeds in a manner similar to the scalar quantizer described above. As seen above, the crux of the MD quantizer design problem is to construct the appropriate labeling function. In Diggavi et al. (2002c), it is shown that an appropriate labeling function, along the lines described for the scalar quantizer, can be constructed very eﬃciently using a linear program. In fact, Diggavi et al. (2002c) shows that such a labeling scheme is very close to being optimal in terms of the rate distortion result given in theorem 9.12 in the high-rate regime.

9.6

Discussion

285

9.5.3

Network Protocols for Route Diversity

In order to utilize route diversity in a network, one of the most important components is clearly the design of MD source coding techniques studied in section 9.5.2. However, an equally important question is the design of routing techniques that can enable the use of MD source coding. In this section we brieﬂy examine these issues from a networking point of view. In order to create route diversity, we need to have multiple routes which are disjoint, in that they do not share common links. This can be done through IP source routing (Keshav, 1997). Source routing is a technique whereby the sender of a packet can specify the route that a packet should take through the network. In the typical IP routing protocol, each router will choose the next hop to forward the packet by examining the destination IP address. However, in source routing, the “source” (i.e., the sender) makes some or all of these decisions. In strict source routing (which is virtually never used), the sender speciﬁes the exact route the packet must take. The more common form is loose source record route (LSRR), in which the sender gives one or more hops that the packet must go through. Therefore, the sender can take an MD code and send each of the descriptions using diﬀerent routes by explicitly specifying them the IP source routing protocol. An alternate technique might be to use an overlay network where there is an application that collects the diﬀerent descriptions and sends them through diﬀerent relay nodes in order to create route diversity. This discussion shows that creating route diversity is architecturally not diﬃcult even using the provisions within the IP (Keshav, 1997). This discussion from a networking point of view also exposes the inherent interactions required between the routing and application layers of the networking protocol stack. Such “interlayer” interactions become particularly important in wireless networks, where route failures could occur more frequently than in wired networks. Therefore, in this case diversity, albeit at a much higher layer in the IP stack, again becomes quite important.

9.6

Discussion In this chapter we studied the emerging role of diversity with respect to three disparate topics. The idea of using multiple instantiations of randomness attempts to turn the presence of randomness to an advantage. For example, in multipleantenna diversity, the degrees of freedom provided by the space diversity is utilized for increased rate or reliability. In mobile ad hoc networks, the random mobility is utilized to route information from source to destination. To realize the beneﬁts promised by the use of diversity, we need to have interactions across networking layers. For example, in opportunistic scheduling (studied in section 9.4.1) the transmission rates that can be supported by the physical layer interact with the resource allocation (scheduling), which is normally only a functionality of the data-link layer. In the multi-user diversity studied in

286

NOTES

mobile ad hoc networks (see section 9.4.2) the routing of the packets interacted with the physical layer transmission. Finally, the MD source coding studied in section 9.5 necessitated an interaction between source coding (application-layer functionality) and routing. These examples of cross-layer protocols are increasingly becoming important in reliable network communication. Diversity is the common thread among several of these cross-layer protocols. The advantages of using diversity in these contexts are just beginning to be realized in practice. There might be many more areas where the ideas of using diversity could have an impact, and this is a topic of ongoing research.

Notes

1 The term suﬃcient statistics refers to a function (perhaps many-to-one) which does not cause loss of information about the random quantity of interest. 2 To be precise, we need to sample equation 9.1 at a rate larger than 2(W + W ), where W is s I I the input bandwidth and Ws is the bandwidth of the channel time variation (Kailath, 1961). 3 In passband communication, a complex signal arises due to in-phase and quadrature phase modulation of the carrier signal, see Proakis (1995). 4 This can be seen by noticing that for M = 1, a suﬃcient statistics is an equivalent scalar t ¯ where h ¯ denotes channel, y˜(b) = h(b)∗ y(b) = ||h(b) ||2 x(b) + h(b)∗ z(b) . In this chapter, |h|2 = hh, complex conjugation, and for a vector h we denote its 2-norm by ||h||2 = h∗ h, where h∗ denotes the Hermitian transpose and ht denotes ordinary transpose. 5 The assumption that {H(b) } is i.i.d. is not crucial. This result is (asymptotically) correct even when the sequence {H(b) } is a mean ergodic sequence (Ozarow et al., 1994). We use the notation H to denote the channel matrix H(b) for a generic block b. 6 For a matrix A, we denote its determinant as det(A) and |A|, interchangeably. 7 In Foschini (1996), a similar expression was derived without illustrating the converse to establish that the expression was indeed the capacity. 8 Here the notation o(1) indicates a term that goes to zero when SN R → ∞. 9 For an information rate of R bits per transmission and a block length of T , we deﬁne the codebook as the set of 2T R codeword sequences of length T . 10 A constellation size refers to the alphabet size of each transmitted symbol. For example, a QPSK modulated transmission has constellation size of 4. 11 We use the notation f (n) = Θ(g(n)) to denote f (n) = O(g(n)) as well as g(n) = O(f (n)). f (n) Here f (n) = O(g(n)) means lim supn→∞ | g(n) | < ∞. 12 Interestingly, this result is speciﬁcally for two descriptions and does not immediately extend to the general case.

10

Designing Patterns for Easy Recognition: Information Transmission with Low-Density Parity-Check Codes

Frank R. Kschischang and Masoud Ardakani

10.1

Introduction Coding for information transmission over a communication channel may be deﬁned as the art of designing a (large) set of codewords such that (i) any codeword can be selected for transmission over the channel, and (ii) the corresponding channel output with very high probability identiﬁes the transmitted codeword. Low-density parity-check codes represent the current state-of-the-art in channel coding. They are a family of codes with ﬂexible code parameters, and a code structure that can be ﬁne-tuned so that decoding can occur at transmission rates approaching the information-theoretical limits established by Claude Shannon, yet with “practical” decoding complexity. In this chapter—which is aimed at the non-expert—we show that these codes are easy to describe using probabilistic graphical models, and that their simplest decoding algorithms (the “sum-product” or “belief-propagation” algorithm, and variations thereof) can be understood as message-passing in the graphical model. We show that a simple Gaussian approximation of the messages passed in the decoder leads to a tractable code-optimization problem, and that solving this optimization problem results in codes whose performance appears to approach the Shannon limit, at least for some channels. Communication channels are typically modeled (with no essential loss of generality) in discrete time. At each unit of time a channel accepts (from the transmitter) a “channel input symbol” and produces (for the receiver) a corresponding “channel output symbol” according to some probabilistic channel model. Information is usually transmitted by using the channel many times, i.e., by transmitting many channel input symbols. In so-called “block coding,” messages are mapped to sequences (x1 , . . . , xn ) of channel inputs of a ﬁxed block-length n. A code is a set of “valid codewords” agreed upon by the transmitter and receiver prior to communication. A code is typically a (carefully selected) subset of the set of all possible channel inputs of length n. Transmission of a codeword gives rise (at the receiver)

288

Designing Patterns for Easy Recognition

to a “received word,” an n-tuple (y1 , . . . , yn ) of channel output symbols. The task of the receiver is to infer from the received word, which codeword—and hence which message—was (ideally, most likely) transmitted. Alternatively, if the message consists of many symbols, the receiver may wish to infer the most likely value of each message symbol. By attempting to determine which codeword was most likely transmitted, a decoding algorithm attempts to solve a noisy pattern recognition problem. There are, therefore, some similarities between the ﬁelds of coding theory and pattern recognition. However, a key diﬀerence that makes the two ﬁelds quite distinct is the fact that the set of valid codewords is under the control of the system designer in the former case, but not (usually) in the latter case. In other words, in coding theory the system designer is given the luxury of choosing the set of patterns to be recognized by the decoding algorithm. A major theme in coding theory research is, therefore, to optimize or ﬁne-tune the structure of the code for eﬀective recognition (decoding) by some particular class of decoding algorithms. Another major diﬀerence between coding theory and typical pattern-recognition problems is the sheer number of patterns to be recognized by the decoding algorithm. In many pattern-recognition problems, the number of diﬀerent patterns (or pattern classes) is relatively small. In coding theory, the numbers can be extraordinarily large. The transmission of k bits corresponds to the selection, by the transmitter, of one codeword from a code of 2k possible codewords. Thus, transmission of single bit requires a code of just two codewords; transmission of two bits requires a code of four codewords, and so on. Typical values of k for codes used in practice range from less than a dozen bits to tens of thousands of bits, and hence the number of diﬀerent codewords to be “recognized” can be as large as 210,000 or more! Despite these huge numbers, decoding algorithms routinely make rapid decoding decisions, reliably producing decoded information at many megabits per second. A key parameter of a code is its rate. A code of 2k codewords, each having block-length n, is said to have a rate of R = k/n bits/symbol (or bits per channeluse). Clearly the rate of a code is a measure of the “speed” at which information is transmitted, normalized per channel-use. To convert from bits per symbol to bits per second, one needs to know the number of symbols that may be transmitted per second, a value that typically scales linearly with the “channel bandwidth.” Channel bandwidths can vary greatly, depending on the application; thus, from the point of view of code design, it is more appropriate to focus on code rate measured in bits/symbol (rather than bits/s). Given a particular channel, one would clearly like to make the code rate as large as possible. On the other hand, one also desires to make reliable decoding decisions, i.e., decisions for which the probability of error approaches zero. At ﬁrst glance, it may seem that there should be a trade-oﬀ between transmission rate and reliability, i.e., for a ﬁxed k, intuition would suggest that for some suﬃciently large n it should be possible to design a code so that the probability of decoding error can be made smaller than any chosen > 0.

10.2

A Brief Introduction to Coding Theory

289

This would certainly be true for the transmission of k = 1 bit over a binary symmetric channel with “crossover probability” p < 1/2. Such a channel accepts xi ∈ {0, 1} at its input and produces a corresponding yi ∈ {0, 1} at its output. Each transmitted symbol is independently “ﬂipped” with probability p, i.e., with probability p we have yi = xi . A single bit can be transmitted with a repetition code of two codewords {000 · · · 0, 111 · · · 1}, where both codewords have the same length n. This code can be decoded according to a “majority rule:” if the majority of the symbols in the received word are zero, then decode to the all-zero codeword; otherwise decode to the all-one codeword. It is easy to see that if n is made suﬃciently large, then the probability of error under the majority rule can be made arbitrarily small for any p < 1/2. In this example, there is a smooth trade-oﬀ between rate and reliability; as the rate decreases, the reliability increases, and one might be inclined to believe that such is the trade-oﬀ in general. In fact, although there is a fundamental trade-oﬀ between code rate and reliability, the trade-oﬀ is abrupt (like a step-function), not smooth. In his seminal 1948 paper (Shannon, 1948), Claude E. Shannon established the remarkable fact that typical communication channels are characterized by a so-called channel capacity, C, with the property that reliable communication (i.e., communication with probability of error approaching zero) is possible for every R < C. More precisely, Shannon showed that for every R < C and every > 0, by choosing a suﬃciently large block-length n, there exists a block code of length n with at least 2nR codewords and a decoding algorithm for this code that yields a probability of decoding error smaller than . Conversely, one may also show that if R > C, then, even using an algorithm that minimizes error probability, it is impossible to achieve arbitrarily small probability of error. Thus, to achieve arbitrarily good reliability, it is not necessary to have R → 0, but rather only to have R < C. Information theory allows us to reﬁne our stated goal of coding for information transmission over a communication channel with capacity C. The goal is to design a code such that (i) any codeword can be selected for transmission over the channel, (ii) the corresponding channel output can be processed by an algorithm of “practical” complexity to identify (with very high probability) the transmitted codeword, and (iii) the rate of the code is “close” to C. In the remainder of this chapter, we will show how low-density parity-check (LDPC) codes achieve this goal.

10.2

A Brief Introduction to Coding Theory Low-density parity-check codes are binary linear block codes (though it is possible to deﬁne non-binary versions as well). Accordingly, we begin by deﬁning what this means. Let F2 denote the ﬁnite ﬁeld of two elements {0, 1}, closed under modulo-two integer addition and multiplication. This ﬁeld has the simplest possible arithmetic: ¯, where x ¯ denotes the for all x ∈ F2 , under addition we have 0 + x = x and 1 + x = x complement of x (thus a two-input F2 -adder is an exclusive-OR gate), and under

290

Designing Patterns for Easy Recognition

multiplication we have 0 · x = 0 and 1 · x = x (thus a two-input F2 -multiplier is an AND gate). For every positive integer n, we let F2n denote the set of n-tuples with components from F2 , which forms a vector space over F2 equipped with the usual component-wise vector addition and with multiplication by scalars from F2 . As is the convention in coding theory, we will always think of such vectors as row vectors. By deﬁnition, a binary linear block code of block-length n and dimension k is a k-dimensional subspace of F2n . It follows that a binary linear block code is itself closed under vector addition and multiplication by scalars; in particular, the sum of two codewords is another codeword, and the code certainly always contains the allzero vector (0, 0, . . . , 0). (Although it is certainly possible to deﬁne nonlinear codes, i.e., codes that are general subsets of F2n , not necessarily subspaces, most codes used in practice are linear codes.) A binary linear code of length n and dimension k will be denoted as an [n, k] code. Such a code has 2k codewords, and hence has rate R = k/n. We will only ever consider such codes with k > 0. From now on, when we write “code,” we mean binary linear block code. How can an [n, k] code be speciﬁed? One way is to observe that such a code C is a k-dimensional vector space, and hence it has a basis (in general, many bases), i.e., a set {v1 , v2 , . . . , vk } of linearly independent vectors that span C. These vectors can be collected together as the rows of a k × n matrix G, called a generator matrix for C, with the evident property that C is the row space of G. For every distinct u ∈ F2k we obtain a distinct codeword v = uG ∈ C. Hence, a generator matrix yields a way to implement an encoder for C, by simply mapping a message u ∈ F2k under matrix multiplication by G to the codeword v = uG ∈ C. Thus a code C may be speciﬁed by providing a generator matrix for C. Another way to specify an [n, k] code C—and the one we will use to deﬁne lowdensity parity-check codes—is to view C as the solution space of some homogeneous system of linear equations in n variables X1 , X2 , . . . , Xn . Since, in F2 , there are only two possible scalars, the structure of any one such equation is exceedingly simple: it is always of the form Xi1 + Xi2 + · · · + Xim = 0,

(10.1)

where {i1 , i2 , . . . , im } is some subset of {1, 2, . . . , n}. Such an equation is sometimes referred to as a parity check, since it speciﬁes that the “parity” (the number of ones) in the subset of the variables indexed by {i1 , . . . , im } should be even, i.e., (10.1) is satisﬁed if and only if an even number of Xi1 , Xi2 , . . . Xim take value one. If we deﬁne the n-tuple h as the vector with value one in components i1 , i2 , . . . , im , and value zero in all other components, then (10.1) may also be written as (X1 , . . . , Xn )hT = 0, where hT denotes the “transpose” of h. Given a system of (n − k) parity-check equations, we may collect the corresponding h-vectors to form the rows of an (n−k)×n matrix H, called a parity-check matrix. The set C = {x : xH T = 0}

10.3

Message-Passing Decoding of LDPC Codes

291

of all possible solutions to this system of equations, i.e., the set of vectors that satisfy all parity checks, is then a code of length n and dimension at least k (and possibly more, if the rows of H are not linearly independent). Thus a code C may be speciﬁed by providing a parity-check matrix for C. Note that, whereas a generator matrix for C gives us a convenient encoder, a parity-check matrix H for C gives us a convenient means of testing a vector for membership in the code, since a given vector r ∈ F2n is a codeword if and only if rH T = 0. More generally, parity-check matrices are useful for decoding since, if r is a non-codeword, it is the structure of parity-check failures in the so-called syndrome rH T that provides evidence about which bits of r need to be changed in order to recover a valid codeword. Low-density parity-check (LDPC) codes have the special property that they are deﬁned via a parity-check matrix that is sparse, i.e., by an H matrix that has only a small number of nonzero entries. If the H matrix has a ﬁxed number of one in each row and a ﬁxed number of ones in each column, then the corresponding code is called a regular LDPC code; otherwise, the code is an irregular code. As an example, one of the earliest families of LDPC codes—the so-called (3,6)-regular LDPC codes, deﬁned by R. G. Gallager at MIT in the early 1960s (Gallager, 1963)—have an H matrix with exactly 6 ones in each row and 3 ones in each column. If we take n reasonably large, say n = 2000, the matrix contains just 3n = 6, 000 ones, whereas a binary matrix of the same size generated by ﬂipping a fair coin for each matrix entry, would on average contain one million ones. Thus we see that the H matrix is very sparse indeed. Why is sparseness important? The answer lies in the nature of the decoding algorithm, which we describe next.

10.3

Message-Passing Decoding of LDPC Codes 10.3.1

From Codes to Graphs

The relationship between variables and equations can be visualized using a graph, such as the one shown in Figure 10.1. This graph (called a Forney-style factor graph (Forney, 2001; Loeliger, 2004)) consists of various vertices and edges (as is conventional in graph theory), and (somewhat unconventionally) also includes a number of “half-edges,” which are edges incident on a single vertex only. The half-edges are denoted as ‘⊥’ in Figure 10.1. Edges and half-edges represent binary variables. A “conﬁguration” is an assignment of a binary value to each edge and half-edge. Certain conﬁgurations will be regarded as “valid conﬁgurations,” and all others as invalid. Vertices in the graph represent “local constraints” that the variables must satisfy in order to form a valid conﬁguration. So-called “equality constraints” (or “equality vertices”), denoted with ‘=’ in Figure 10.1, constrain all neighboring variables (i.e., all incident edges) to take on the same value in every valid conﬁguration, whereas so-called

292

Designing Patterns for Easy Recognition

A factor graph for a (very small) (3,6)-regular LDPC code. Each edge (and half-edge) represents a binary variable. The boxes labeled = are equality constraints that enforce the rule that all incident edges are to have the same value in every valid conﬁguration. The boxes labeled + are parity-check constraints that enforce the rule that all incident edges are to have even parity (zero-sum, modulo two).

Figure 10.1

“parity-check constraints” (or “check vertices”), denoted with ‘+’ in Figure 10.1, constrain the neighboring variables to form a conﬁguration having an even number of ones, i.e., a modulo-two sum of zero. The valid conﬁgurations are precisely those that satisfy all local constraints. The half-edges represent the codeword symbols v1 , v2 , . . . , v18 . Half-edges can be viewed as the “interface” (or “read-out”) between the conﬁguration space induced by the internal structure of the graph, and the desired “external” behavior. Equivalently, the full edges in the graph may be regarded as hidden (or “auxiliary” or “state”) variables, and the half-edges as observed or primary variables. The equality constraints essentially serve to “copy” the value of each codeword symbol in a valid conﬁguration to the neighboring (full) edges. Each of the parity-check constraints implements a single parity-check equation; for example, the highlighted parity-check constraint in Figure 10.1 essentially implements the equation v3 + v6 + v9 + v12 + v15 + v18 = 0; however, instead of involving the variables v3 , v6 , etc., directly, the highlighted parity-check constraint vertex involves copies of these variables. It should now be clear that the set of valid conﬁgurations projected on the halfedges in Figure 10.1 form a binary linear code satisfying the 9 diﬀerent parity-check equations implemented by the check vertices. Indeed, as the reader may verify by

10.3

Message-Passing Decoding of LDPC Codes

tracing edges, this ⎡ 1 1 ⎢ 1 0 ⎢ 0 1 ⎢ 0 0 ⎢ H=⎢ ⎢ 0 0 ⎢ 1 0 ⎢ 0 1 ⎣ 0 0 0 0

293

code is deﬁned by the parity-check matrix 1 0 0 1 0 1 0 0 0

1 1 0 0 0 0 1 0 0

1 1 0 0 0 0 0 1 0

1 0 1 1 0 0 0 0 0

0 0 1 0 1 0 0 1 0

0 0 0 0 1 1 1 0 0

0 1 0 1 1 0 0 0 0

0 0 1 0 1 0 1 0 0

0 0 1 0 1 1 0 0 0

0 0 0 1 1 1 0 0 0

0 0 0 0 0 0 1 1 1

0 1 0 0 0 0 0 1 1

0 0 1 1 0 0 0 0 1

0 1 0 0 0 0 0 1 1

0 0 0 0 0 1 1 0 1

0 0 0 1 0 0 0 1 1

⎤ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎦

It is clear that H encodes the incidence structure of the graph, with an edge corresponding to each nonzero entry of H. If a nonzero entry occurs in row i and column j, then the edge connects the ith check vertex with the jth equality vertex. Because of the correspondence between H and the factor graph, sparseness of H implies sparseness of the graph and vice versa. 10.3.2

Channel Models

As noted above, communication channels are typically modeled in discrete time: a channel accepts a channel input xi and produces a corresponding channel output yi , according to some probabilistic model. For example, the binary symmetric channel described earlier accepts, at time i, a binary digit xi ∈ {0, 1} at its input, and produces a binary digit yi ∈ {0, 1} at its output, with the property that yi = xi with probability 1 − p (and therefore yi = xi with probability p). The parameter p is called the cross-over probability of the channel. The binary symmetric channel is assumed to be memoryless, which means that, given the ith channel input xi , the channel output yi is independent of all other channel inputs and outputs, i.e., (assuming n channel inputs and outputs in total) p(yi |x1 , . . . , xn , y1 , . . . , yi−1 , yi+1 , . . . , yn ) = p(yi |xi ). The capacity C(p) of the binary symmetric channel with crossover probability p is given by (Shannon, 1948) C(p) = 1 − H(p) bits/symbol where H(p) denotes the binary entropy function H(p) = −p log2 p − (1 − p) log2 (1 − p). The capacity is plotted in Figure 10.2(a). We will also consider the binary-input additive white Gaussian noise (AWGN) channel, which at time i accepts a channel input xi ∈ {−1, 1}, and produces the (real-valued) output yi = xi +ni , where ni is a zero-mean Gaussian random variable with variance σ 2 . This channel is also assumed to be memoryless, and serves as a model of the widely implemented continuous-time transmission schemes known as

294

Designing Patterns for Easy Recognition

binary phase-shift keying (BPSK) and quadrature phase-shift keying (QPSK). The capacity C(σ) of the binary-input additive white Gaussian noise channel with noise variance σ 2 is given by

∞ 1 C(σ) = 1 − √ exp(−u2 /2) log2 (1 + exp(−2(σu + 1)/σ 2 ))du bits/symbol. 2π −∞ This function is plotted in Figure 10.2(b). Also plotted is the function 1 log2 (1 + 1/σ 2 ), 2 which represents the capacity of an additive white Gaussian noise channel in which the channel input is constrained to have unit second moment, but is unconstrained in value. Instead of using the noise variance σ as a parameter of an AWGN channel, one often encounters the so-called “bit-energy to noise-density ratio,” denoted Eb /N0 . This terminology arises in the context of continuous-time AWGN channels, in which

(a)

(b)

Capacity as a function of channel parameter: (a) the binary symmetric channel with crossover probability p; (b) the binary-input additive white Gaussian noise channel with noise variance σ 2 .

Figure 10.2

10.3

Message-Passing Decoding of LDPC Codes

295

the one-sided noise power spectral density is typically parameterized by the value N0 . For a code of rate R and a noise variance of σ 2 , we have Eb 1 = . N0 2Rσ 2 Often the value of Eb /N0 is quoted as a value in decibels (dB), i.e., the value quoted is 10 log10 (Eb /N0 ). Thus, for example, a code of rate R operating in an AWGN channel with an Eb /N0 of x dB is operating in a channel of noise variance σ2 =

10−x/10 . 2R

In the coding literature one also often encounters the term “Shannon limit.” The Shannon limit for a code of rate R is the value of the channel parameter corresponding to the worst channel that (in principle) could be used with a code of that rate, i.e., the channel parameter for which the capacity of the channel is R. In the case of AWGN channel, the performance of a coding scheme is often quoted in terms of a distance (in dB) from the Shannon limit. Thus, if the code achieves acceptable performance at a noise variance σ 2 , but the corresponding Shannon limit is σ02 > σ 2 , then the distance to the Shannon limit is given as 10 log10 (σ02 /σ 2 ) dB. 10.3.3

From Graphs to Decoding Algorithms

Most decoding algorithms for low-density parity-check codes operate by passing “messages” along the edges of the graph describing the code. We will begin by providing an intuitive description of this process. Initially, messages are derived from the channel outputs. The channel outputs are translated into a a “belief” about the value of the corresponding codeword symbol, where a “belief” is a guess about the value (zero or one) along with a measure of conﬁdence in that guess. Unfortunately, due to channel noise, some of these beliefs are, in fact, erroneous. Beliefs about each codeword symbol are communicated to the check vertices. By enforcing the rule that in a valid conﬁguration the modulo-two sum of the bit values is zero, the checks can update the beliefs. For example, if the beliefs received at a check form a conﬁguration that does not satisfy the zero-sum rule, then the bit with the least conﬁdence could be informed that it should probably alter its belief. The process of sending messages from equality vertices to check vertices and back again is called an “iteration,” and after several iterations (depending on the code and the noise in the channel), the beliefs about the symbols, with high probability, reﬂect the transmitted conﬁguration. Now it becomes clear why sparseness in the graph is important. Firstly, the total amount of computation required per iteration is proportional to the number of edges in the decoding graph, and this number is exactly equal to the number of nonzero entries in the H matrix. Thus, the more sparse the graph, the smaller the decoding complexity. Secondly, sparseness helps to make it diﬃcult for short cycles in the graph (which can sometimes cause reinforcement of erroneous beliefs)

296

Designing Patterns for Easy Recognition

to inﬂuence the decoder unduly. We also notice that this decoding algorithm naturally supports parallelism, as the message transfer between checks and variables can, in principle, all occur simultaneously. This observation is the foundation for a number of hardware implementations for LDPC decoders, in which processing nodes correspond directly to factor-graph vertices, and wires connecting these nodes correspond directly to factor-graph edges. We will now give a more precise description of message-passing decoding, starting with the so-called sum-product algorithm. See (Kschischang et al., 2001) for more details. Messages passed on an edge during decoding are probability mass functions for the corresponding binary variable. A probability mass function p(x) for a binary variable X can be encoded with just a single parameter (e.g., p(0), p(1), p(0)−p(1), p(0)/p(1), etc.), and hence messages are real-valued scalars. A very commonly used parametrization, is the log-likelihood ratio (LLR), deﬁned as ln(p(0)/p(1)). Note that the sign of an LLR value indicates which symbol-value (0 or 1) is more likely, and so can be used to make a decision on the value of that symbol. The magnitude of the LLR can be interpreted as a measure of conﬁdence in the decision; a large magnitude indicates a large disparity between p(0) and p(1), and hence a greater conﬁdence in the truth of the decision. The “neutral message” corresponding to the uniform distribution will be denoted as μ0 . If an LLR representation is used, then μ0 = 0. Messages are always directed (along an edge or half-edge) in the graph. A message on a half edge directed to a vertex v will be denoted as μ→v , and a message on a full edge directed from a vertex v1 to a vertex v2 will be denoted as μv1 →v2 . We will denote the set of neighbors of a vertex v1 as N (v1 ). If v2 ∈ N (v1 ), then N (v1 ) \ {v2 } is the set of neighbors of v1 excluding v2 . Initialization: The decoding algorithm is initialized by sending neutral messages on all edges, i.e., μv1 →v2 = μv2 →v1 = μ0 for every pair of vertices v1 , v2 connected by an edge. The half-edges are initialized with so-called “intrinsic” or “channel” messages, corresponding to the received channel output. In particular, for a binary symmetric channel with crossover probability p, the LLR associated with channel output y ∈ {0, 1} is given by 1 −1 , λ(y) = (−1)y ln p and this is the initial message sent toward the corresponding equality vertex in the graph. Similarly, for the binary-input AWGN channel with noise variance σ 2 , the LLR associated with channel output y ∈ R is given by λ(y) = 2y/σ 2 ,

10.3

Message-Passing Decoding of LDPC Codes

297

assuming that transmission of a zero corresponds to the +1 channel input, and transmission of a one corresponds to the −1 channel input. Local Updates: Messages are updated at the vertices according to the principle that the message μv1 →v2 sent from vertex v1 to its neighbor v2 is a function of the messages directed toward v1 on all edges other than the edge {v1 , v2 }. This principle is one of the pillars that leads to an analysis of the decoder; furthermore, this principle leads to optimum decoding in a cycle-free graph (see Kschischang et al., 2001). Assuming that messages are represented as LLR values, then the message sent by the sum-product algorithm from an equality vertex v1 to a neighboring check vertex v2 is given as μv →v1 , (10.2) μv1 →v2 = μ→v1 + v ∈N (v1 )\{v2 }

where μ→v1 denotes the channel message received along the half-edge connected to v1 . Similarly, the message sent from a check vertex v2 to a neighboring equality vertex v1 is given as ⎛ ⎞ tanh(μv →v2 /2)⎠ . (10.3) μv2 →v1 = 2 tanh−1 ⎝ v ∈N (v2 )\{v1 }

The hyperbolic tangent functions involved in this update rule are actually performing a change of message representation: if λ is the LLR value ln(p(0)/p(1)), then tanh(λ/2) = p(0)−p(1), the probability diﬀerence. The product of tanh(·) functions in (10.3) actually implements a product of probability diﬀerences, which can itself be seen as the diﬀerence in probabilities of local conﬁgurations having even parity with those having odd parity. Messages received on full edges at an equality vertex are referred to as “extrinsic” messages. Extrinsic messages, unlike the “intrinsic” channel message reﬂect the structure of the code and change from iteration to iteration; when decoding is successful, the quality (magnitude in an LLR implementation) of the extrinsic messages improves from iteration to iteration. In this way, the extrinsic messages can “overwhelm” erroneous channel messages, leading to successful decoding. Update Schedule: The order in which messages are updated is referred to as the “update schedule.” A commonly used schedule is to send messages from each equality vertex toward the neighboring check vertices (in any order), and then to send messages in the opposite direction (in any order). One complete such update is referred to as an “iteration,” and usually many iterations are performed before a decoding decision is reached. Other update schedules can lead to faster convergence (Sharon et al., 2004; Xiao and Banihashemi, 2004), but we will not consider these here.

298

Designing Patterns for Easy Recognition

Termination: Decoding decisions are based on all of the messages (both intrinsic and extrinsic) directed toward each equality vertex v. Decisions are made on a symbol-by-symbol basis. With LLR messages, the decision statistic is given as μv →v . μ = μ→v + v ∈N (v)

If μ > 0, then the corresponding codeword symbol value is chosen to be 0; otherwise it is chosen to be 1. Usually iterations are performed until these decisions yield a valid codeword, or until the number of iterations reaches some allowed maximum. Note that the total computational complexity (assuming a ﬁxed maximum number of iterations, and a ﬁxed distribution of vertex degrees) scales linearly with the block length of the code. In addition to the sum-product algorithm, a number of other (often simpler) message-passing algorithms have been studied. These include the “min-sum” algorithm and Gallager’s “decoding algorithm B,” which are described next. Min-sum algorithm: In the min-sum algorithm, the update rule at an equality vertex is the same as the sum-product algorithm (10.2), but the update rule at a check vertex v2 is simpliﬁed to μv2 →v1 = min |μv →v2 | · sign(μv →v2 ). (10.4) v ∈N (v2 )\{v1 }

v ∈N (v2 )\{v1 }

Notice that the tanh−1 of the product of tanh’s is approximated as the minimum of the absolute values times the product of the signs. This approximation becomes more accurate as the magnitude of the messages is increased. Gallager’s decoding algorithm B: In this algorithm, introduced by Gallager (1963), the message alphabet is {0, 1}. In other words, the messages communicate “decisions” only, without an associated reliability. The update rule at a check vertex v2 is > μv2 →v1 = μv →v2 , (10.5) v ∈N (v2 )\{v1 }

where ⊕ represents the modulo-two sum of binary messages. At an equality vertex v1 of degree dv + 1, the outgoing message μv1 →v2 is ' μ→v1 if ∃v1 , v2 , . . . , vb ∈ N (v1 ) \ {v2 } : μv1 →v1 = · · · = μvb →v1 = μ→v1 μv1 →v2 = , μ→v1 otherwise (10.6) where b is an integer in the range & dv2−1 ' < b < dv . Here, the outgoing message of an equality vertex is the same as the intrinsic message, unless at least b of the extrinsic messages disagree. The value of b may change from one iteration to another. The optimum value of b for a (dv , dc )-regular LDPC code (i.e., a code with an H-matrix having dc ones in every row and dv ones in every column) was computed by Gallager

10.4

LDPC Decoder Analysis

299

(1963) and is the smallest integer b for which 2b−dv +1 1 + (1 − 2pe )dc −1 1−p ≤ , p 1 − (1 − 2pe )dc −1

(10.7)

where p and pe are channel crossover probability (intrinsic message error rate) and extrinsic message error rate, respectively. It can be proved that Algorithm B is the best possible binary message-passing algorithm for regular LDPC codes.

10.4

LDPC Decoder Analysis 10.4.1

Decoding Threshold

For a binary symmetric channel with parameter p < 1/2 and an AWGN channel with parameter σ, the performance of an iterative decoder degrades with increasing channel parameter. Richardson and Urbanke (2001) studied the performance of families of low-density parity-check codes with a ﬁxed proportion of check and equality vertices of certain degree. In the limit as the block length goes to inﬁnity (so that the neighbors, next-neighbors, next-next-neighbors, etc., of each vertex, taken to a particular depth, can be assumed to form a tree), they show that the family exhibits a threshold phenomenon: there is a “worst-channel” for which (almost all) members of the family have a vanishing error probability as the block length and number of iterations go to inﬁnity. This channel condition is called the threshold of the code family. For example, the threshold of the family of (3,6)-regular codes on the AWGN channel under sum-product decoding is 1.1015 dB, which means that if an inﬁnitely long (3,6)-regular code were used on an AWGN channel, convergence to zero error rate is almost surely guaranteed whenever Eb /N0 is greater than 1.1015 dB. If the channel condition is worse than the threshold, a non-zero error rate is assured. In practice, when ﬁnite-length codes are used, there is a gap between the Eb /N0 required to achieve a certain (small) target error probability and the threshold associated with the given family, but this gap shrinks as the code length increases. The main aim in the asymptotic analysis of families of LDPC codes is to determine the threshold associated with the family, and one of the aims of code design is to choose the parameters of the family so that the threshold can be made to approach channel capacity. 10.4.2

Extrinsic Information Transfer (EXIT) Charts

An iterative decoder can be thought of as a “black box” that at each iteration takes two sources of knowledge about the transmitted codeword—the intrinsic information and the extrinsic information—and attempts to obtain an “improved” knowledge about the transmitted codeword. The “improved” knowledge is then used

300

Designing Patterns for Easy Recognition

as the extrinsic information for the next iteration. When decoding is successful, the extrinsic information gets better and better as the decoder iterates. Therefore, in all methods of analysis of iterative decoders, statistics of the extrinsic messages at each iteration are studied. For example, one might study the evolution of the entire probability density function (pdf) of the extrinsic messages from iteration to iteration. This is the most complete (and probably most complex) analysis, and is known as density evolution (Richardson and Urbanke, 2001). However, as an approximate analysis, one may study the evolution of a representative or an approximate parametrization of the true density. An example of this approach is to use so-called “extrinsic information transfer” (EXIT) charts (Divsalar et al., 2000; El Gamal and Hammons, 2001; ten Brink, 2000, 2001). In EXIT-chart analysis, instead of tracking the density of messages, one tracks the evolution of a single parameter—a measure of the decoder’s success— iteration by iteration. For example one might track the “signal-to-noise ratio” of the extrinsic messages (Divsalar et al., 2000; El Gamal and Hammons, 2001), their error probability (Ardakani and Kschischang, 2004) or the mutual information between messages and decoded bits (ten Brink, 2000). Initially, the term “EXIT chart” was used when tracking mutual information; however, the use of this term was generalized in (Ardakani and Kschischang, 2004) to the tracking of other parameters as well. Let s denote the message-parameter being tracked, and let s0 denote the parameter associated with the channel messages. If si denotes the parameter associated with the extrinsic messages at the input of the ith iteration, then an EXIT chart is the function f (si , s0 ) that gives the value of the message parameter si+1 at the output of the ith iteration, i.e., we have si+1 = f (si , s0 ). In the remainder of this chapter, we use EXIT charts based on tracking the message error rate, as we ﬁnd them most useful for our applications. Thus we track the proportion of messages that give the “wrong” value for the corresponding symbol. If pin denotes this proportion at the input of an iteration, and pout denotes this proportion at the output of an iteration, then we have pout = f (pin , p0 ), where p0 denotes the proportion of “wrong” channel messages. For a ﬁxed p0 this function can be plotted using pin -pout coordinates. Usually EXIT charts are presented by plotting both f and its inverse f −1 , as this makes the visualization of the decoder easier. Figure 10.3 shows the concept. As can be seen from the ﬁgure, decoder progress can be visualized as a series of steps, shuttling between f and f −1 (as pout of one iteration becomes pin of the next). It can be seen that using EXIT charts, one can study how many iterations are required to achieve a target message error rate.

LDPC Decoder Analysis

301

Regular (3, 6) code 0.12

Iteration 1

0.1

Iteration 2 0.08

Iteration 3

out

Iteration 4 p

10.4

0.06

0.04 Predicted Trajectory Actual Results

0.02

0 0

0.02

0.04

0.06 p

0.08

0.1

0.12

in

An EXIT chart based on message error rate. Simulation results are for a randomly generated (3,6)-regular code of block length 200,000 on an AWGN with Eb /N0 = 1.75 dB.

Figure 10.3

The region of the graph between f and f −1 is referred to as the “decoding tunnel.” If the “decoding tunnel” of an EXIT chart is closed, i.e., if f and f −1 cross at some large pin , so that for some pin we have pout > pin , successful decoding to a small error probability does not occur. In such cases we say that the EXIT chart is closed (otherwise, it is open). An open EXIT chart always lies below the 45-degree line pout = pin . As p0 gets worse (i.e., as the channel degrades), the decoding tunnel becomes tighter and tighter, and hence the decoder requires more and more iterations to converge to a target error probability. Eventually, when p0 is bad enough, the tunnel closes completely. This condition gives an estimate of the code threshold. i.e., we may estimate the threshold p∗0 as the worst channel condition for which the tunnel is open by deﬁning p∗0 as p∗0 = arg sup{f (pin , p0 ) < pin , for all 0 < pin ≤ p0 }. p0

EXIT chart analysis is not as accurate as density evolution, because it tracks just a single parameter as the representative of a pdf. For many applications, however, EXIT charts are very accurate. For instance in (ten Brink, 2000, 2001), EXIT charts are used to approximate the behavior of iterative turbo decoders on a Gaussian channel very accurately. In (Ardakani and Kschischang, 2004) it is shown that, using EXIT charts, the threshold of convergence for LDPC codes on AWGN channel can be approximated within a few thousandths of a dB of the actual value.

302

Designing Patterns for Easy Recognition

In next section we show that EXIT charts can be used to design irregular LDPC codes which perform not more than a few hundredths of a dB worse than those designed by density evolution. One should also notice that when the pdf of messages can truly be described by a single parameter, e.g., in the so-called “binary erasure channel,” EXIT chart analysis is equivalent to density evolution. 10.4.3

Gaussian Approximations

There have been a number of approaches to one-dimensional analysis of sumproduct decoding of LDPC codes on the AWGN channel (Ardakani and Kschischang, 2004; Chung et al., 2001; Divsalar et al., 2000; Lehmann and Maggio, 2002; ten Brink and Kramer, 2003; ten Brink et al., 2004), all of them based on the observation that the pdf of the decoder’s LLR messages is approximately Gaussian. This approximation is quite accurate for messages sent from equality vertices, but less so for messages sent from check vertices. In this subsection we describe an accurate one-dimensional analysis for LDPC codes based on a Gaussian assumption only for the messages sent from the equality vertices. Because AWGN channels and binary symmetric channels treat 0’s and 1’s symmetrically, and because LDPC codes are binary linear codes, the behavior of a sum-product decoder is independent of which codeword was transmitted (Richardson and Urbanke, 2001). In the analysis of a decoder, we may therefore assume that the all-zero codeword (equivalent to the all-{+1} channel word) is transmitted. A probability density function f (x) is called symmetric if f (x) = ex f (−x). In (Richardson and Urbanke, 2001) it has been shown that if the LLR of the channel messages is symmetric, then all messages sent in sum-product decoding are symmetric. A Gaussian pdf with mean m and variance σ 2 is symmetric if and only if σ 2 = 2m. As a result, a symmetric Gaussian density can be expressed by a single parameter. Under the assumption that the all-zero codeword was transmitted over an AWGN channel, it turns out that the intrinsic LLR messages have a symmetric Gaussian density with a mean of 2/σ 2 and a variance of 4/σ 2 , where σ 2 is the variance of the Gaussian channel noise. It follows that under sum-product decoding, all messages remain symmetric. In addition, since the update rule at the equality vertices is the summation of incoming messages, according to the central limit theorem, the density of the messages at the output of equality vertices tends to be Gaussian, so it seems sensible to approximate them with a symmetric Gaussian. To avoid a Gaussian assumption on the output of check vertices, we consider one whole iteration at once. That is to say, we study the input-output behavior of the decoder from the input of the iteration (messages from equality vertices to check vertices) to the output of that iteration (messages from equality vertices to check vertices). Figure 10.4 illustrates the idea. In every iteration we assume that the input and the output messages shown in Figure 10.4, which are outputs of equality vertices, are symmetric Gaussian. We start with Gaussian distributed messages at

10.4

LDPC Decoder Analysis

303

input to next iteration channel message

messages from previous iteration Figure 10.4

A depth-one tree for a (3,6)-regular LDPC code.

the input of the iteration and compute the pdf of messages at the output. This can be done by “one-step” density evolution. Then we approximate the actual output pdf with a symmetric Gaussian. Since we assume that the all-zero codeword is transmitted, the negative tail of this density reﬂects the message error rate. As a result, we can track the evolution of message error rate and represent it in an EXIT chart. This technique led to the results shown in Figure 10.3, showing the close agreement between simulation results and the decoding behavior predicted by the EXIT-chart analysis. We refer to the method of approximating only the output of equality vertices with a Gaussian (but not the output of the check vertices) as the “semi-Gaussian” approximation. 10.4.4

Analysis of Irregular LDPC Codes

A single depth-one tree cannot be deﬁned for irregular codes, since not all vertices (even of the same type) have the same degree. If the check degree distribution is ﬁxed, each equality vertex of a ﬁxed degree gives rise to its own depth-one tree. Irregularity in the check vertices is taken into account in these depth-one trees. For any ﬁxed check degree distribution, we refer to the depth-one tree associated with a degree i equality vertex as the “degree i depth-one tree.” For reasons similar to the case of regular codes, we assume that at the output of any depth-one tree, the pdf of LLR messages is well-approximated by a symmetric Gaussian. As a result, the pdf of LLR messages at the input of check vertices can be approximated as a mixture of symmetric Gaussian densities. The weights of this mixture are determined by the proportion of equality vertices of each degree. Nevertheless, at the output of a given equality vertex, the distribution is still close to Gaussian and so the semi-Gaussian method can be used to ﬁnd the EXIT charts corresponding to the equality vertices of diﬀerent degrees. In other words, for any i, using the “degree i depth-one tree,” an EXIT chart associated with equality vertices of degree i can be found. We call such EXIT charts elementary EXIT charts. We

304

Designing Patterns for Easy Recognition

have pout,i = fi (pin , p0 ), where pout,i is the message error rate at the output of degree i equality vertices. Now, using Bayes’ rule, pout for the mixture of all equality vertices can be computed as P r(degree = i)P r(error|degree = i) pout = i≥2

=

λi fi (pin , p0 ),

(10.8)

i≥2

where λi denotes the proportion of edges incident on equality vertices of degree i. Thus we obtain the important result that the overall EXIT chart can be obtained as a weighted linear combination of elementary EXIT charts. A similar formulation can be used when the mean of the messages is the parameter tracked by the EXIT chart (Chung et al., 2001). It has been shown in (Tuechler and Hagenauer, 2002) that when the messages have a symmetric pdf, mutual-information also combines linearly to form the overall mutual-information.

10.5

Design of Irregular LDPC Codes We now describe how this EXIT-chart framework may be used to design irregular LDPC codes. In this framework, the design problem can be simpliﬁed to a linear program. Let λi denote the proportion of factor graph edges incident on an equality vertex of degree i, and let ρj denote the proportion of edges incident on a check vertex of degree j. We formulate the design problem for an irregular LDPC code as that of shaping an EXIT chart from a group of elementary EXIT charts (according to (10.8)) so that the rate of the code is maximized, but subject to the constraint that the resulting EXIT chart remains open, i.e., so that f (x) < x for all x ∈ (0, p0 ], where p0 is the initial message error rate at the decoder. ρ /j It can be shown that the rate of an LDPC code is at least 1− λji /i and hence, for a ﬁxed check degree distribution, the design problem can be formulated as the following linear program: & maximize: i≥2 λi /i λi ≥ 0, & i≥2 λi = 1 and

& ∀pin ∈ (0, p0 ] i≥2 λi fi (pin , p0 ) < pin . In the above formulation, we have assumed that the elementary EXIT charts are given. In practice, to ﬁnd these curves we need to know the degree distribution of the code. We need the degree distribution to associate every input pin to its equivalent input pdf, which is in general assumed to be a Gaussian mixture. In subject to:

10.5

Design of Irregular LDPC Codes

305

Table 10.1 A List of Irregular Codes Designed for the AWGN Channel by the Semi-Gaussian Method Degree sequence

Code 1

Code 2

Code 3

Code 4

Code 5

Code 6

dv1 , λdv1 dv2 , λdv2 dv3 , λdv3 dv4 , λdv4 dv5 , λdv5 dv6 , λdv6 dv7 , λdv7 dv8 , λdv8 dv9 , λdv9 dv10 , λdv10

2, .1786 3, .3046 5, .0414 6, .0531 7, .0007 10, .4216 — — — —

2, .1530 3, .2438 7, .1063 10, .2262 14, .0305 19, .0001 23, .1293 32, .0736 38, .0372 —

2, .1439 3, .1602 5, .1277 6, .0219 7, .0279 8, .0103 12, .1551 30, .0004 37, .3525 40, .0001

2, .1890 3, .1158 4, .1153 6, .0519 7, .0875 14, .0823 15, .0007 16, .0001 39, .3573 40, .0001

2, .2444 3, .1687 4, .0130 5, .1088 7, .1120 14, .1130 15, .0577 25, .0063 25, .0109 40, .1652

2, .3000 3, .1937 4, .0192 7, .2378 14, .0158 15, .0114 20, .0910 25, .0002 30, .0232 40, .1077

dc

40

24

22

10

7

5

Threshold σ

0.5072

0.6208

0.6719

0.9700

1.1422

1.5476

Rate

0.9001

0.7984

0.7506

0.4954

0.3949

0.2403

Gap to Shannon limit (dB)

0.1308

0.0630

0.0666

0.1331

0.1255

0.2160

other words, prior to the design, the degree distribution is not known and as a result, we cannot ﬁnd the elementary EXIT charts to solve the linear program above. To solve this problem, we suggest a recursive solution. At ﬁrst we assume that the input message to the iteration has a single symmetric Gaussian density instead of a Gaussian mixture. Using this assumption, we can map every message error rate at the input of the iteration to a unique input pdf and so ﬁnd fi curves for diﬀerent i. (It is interesting that even with this assumption the error in approximating the threshold of convergence, based on our observations, is less than 0.3 dB and the codes which are designed have a convergence threshold of at most 0.4 dB worse than those designed by density evolution. One reason for this is that when the input of a check vertex is mixture of symmetric Gaussians, due to the computation at the check vertex, its output is dominated by the Gaussian in the mixture having smallest mean.) After ﬁnding the appropriate degree distribution based on the single Gaussian assumption we use this degree distribution to ﬁnd the correct elementary EXIT charts based on a Gaussian mixture. Now we use the corrected curves to design an irregular code. In this level of design, the designed degree distribution is close to the degree distribution used in ﬁnding the elementary EXIT charts. Therefore, analyzing this code with its actual degree distribution shows minor error. One can continue these recursions for higher accuracy. However, in our examples after one iteration of design the designed threshold and the exact threshold diﬀered less than 0.01 dB.

306

Designing Patterns for Easy Recognition

We have designed a number of irregular codes with a variety of rates using this Gaussian approximation. The results are presented in Table 10.1. In the design of the presented codes, we have avoided any equality vertices or check vertices with degrees higher than 40. Table 10.1 suggests that for code rates more than 0.25, the method is quite successful. For rates greater than 0.85, getting close to capacity requires high-degree check vertices. To show that our method can actually handle high-rate codes, we designed a rate 0.9497 code, which uses check vertices of degree 120 but no equality vertices of degree greater than 40. The degree sequence for this code is λ = {λ2 = 0.1029, λ3 = 0.1823, λ6 = 0.1697, λ7 = 0.0008, λ9 = 0.1094, λ15 = 0.0240, λ35 = 0.2576, λ40 = 0.1533}. The threshold of this code is at σ = 0.4462, which means it has a gap of only 0.0340 dB from the Shannon limit.

10.6

Conclusions and Future Prospects Low-density parity-check (LDPC) codes are a ﬂexible family of codes with a simple decoding algorithm. As we have shown in this chapter, the structure of the codes can be ﬁne-tuned to allow for decoding even at channel parameters that approach the Shannon limit. Because of their excellent properties, LDPC codes have attracted an enormous research interest, and there is now a large body of literature which we have not attempted to survey in this chapter. Low-density parity-check codes are now beginning to emerge in a variety of communication standards, including, e.g., the DVB-S2 standard for digital video broadcasting by satellite (Eroa et al., 2004). An important research direction is that of ﬁnding LDPC codes with relatively short block lengths (a few thousand bits, say), but which still have excellent iterative decoding performance. For further reading in the area of LDPC codes, we recommend the books of Lin and Costello (2004) and MacKay (2003) and the survey article of (Richardson and Urbanke, 2003) as excellent starting points.

11

Turbo Processing

Claude Berrou, Charlotte Langlais, and Fabrice Seguin

Turbo processing is the way to process data in communication receivers so that no information stemming from the channel is wasted. The ﬁrst application of the turbo principle was in error correction coding, which is an essential function in modern telecommunications systems. A novel structure of concatenated codes, nicknamed turbo codes, was devised in the early 1990s in order to beneﬁt from the turbo principle. Turbo codes, which have near-optimal performance according to the theoretical limits calculated by Shannon, have since been adopted in several telecommunications standards. The turbo principle, also called the message-passing principle or belief propagation, is exploitable in signal processing other than error correction, such as detection and equalization. More generally, every time separate processors work on data sets that have some link together, the turbo principle may improve the result of the global processing. In digital circuits, the turbo technique is based on an iterative procedure, with multiple repeated operations in all the processors considered. Another more natural possibility is the use of analog circuits, in which the exchange of information between the diﬀerent processors is continuous.

11.1

Introduction Error correction coding, also known as channel coding, is a fundamental function in modern telecommunications systems. Its purpose is to make these systems work even in tough physical conditions, due for instance to a low received signal level, interference, or fading. Another important ﬁeld of application for error correction coding is mass storage (computer hard disk, CD and DVD-ROM, etc.), where the ever-continuing miniaturization of the elementary storage pattern makes reading the information more and more tricky. Error correction is a digital technique, that is, the information message to protect is composed of a certain number of digits drawn from a ﬁnite alphabet. Most often, this alphabet is binary, with logical elements or bits 0 or 1. Then, error

308

Turbo Processing

correction coding, in the so-called systematic way, involves adding some number of redundant logical elements to the original message, the whole being called a codeword. The mathematical law that is used to calculate the redundant part of the codeword is speciﬁc to a given code. Besides this mathematical law, the main parameters of a code are as follows. The code rate: the ratio between the number of bits in the original message and in the codeword. Depending on the application, the code rate may be as low as 1/6 or as high as 9/10. The minimum Hamming distance (MHD): the minimum number of bits that diﬀer from one codeword to any other. The higher the MHD, the more robust the associated decoder confronted with multiple errors. The ability of the decoder to exploit soft (analog) values from the demodulator, instead of hard (binary) values. A soft value (that is, the sign and the magnitude) carries more information than a hard value (only the sign). The complexity and the latency of the decoder. Since the seminal work by Shannon on the potential of channel coding (Shannon, 1948), many codes have been devised and used in practical systems. The state of the art, in the early 1990s, was the coding construction depicted in ﬁg. 11.1. This is called “standard concatenation” and is made up of a serial combination of a Reed-Solomon (RS) encoder, a symbol interleaver, and a convolutional encoder. The corresponding decoder (ﬁg. 11.1) is composed of a Viterbi decoder, a symbol de-interleaver, and an RS decoder. This concatenated scheme works nicely because the Viterbi decoder can easily beneﬁt from soft samples coming from the demodulator, while the RS decoder can withstand residual bursty errors that may come from the Viterbi decoder. Nevertheless, although the MHD of the concatenated code is very large, the decoder does not provide optimal error correction. Roughly, the performance is 3 or 4 dB from the theoretical limit. Where does this loss come from?

data to encode

Reed-Solomon encoder

Interleaver

Convolutional encoder

(a)

Channel

decoded data

Reed-Solomon decoder

De-interleaver

Convolutional (Viterbi) decoder

(b)

Standard concatenation of a Reed-Solomon encoder and a convolutional encoder, and the associated decoder.

Figure 11.1

11.2

Random Coding and Recursive Systematic Convolutional (RSC) Codes

309

The inner Viterbi decoder, processing analog-like input samples, is locally optimum, that is, it derives the maximum beneﬁt from the redundancy added by the convolutional encoder. The outer RS decoder, which is also locally optimum, beneﬁts from the work of the inner decoder and from the redundancy added by the RS encoder. Both decoders are optimum, each of them separately, but their association is not optimal: the Viterbi decoder does not exploit the redundancy oﬀered by the RS codeword. A global decoder, which would use the whole redundancy in one processing step, would be extremely complex and is not realistic. The way to contemplate near-optimal decoding of the standard concatenated scheme is to enable the inner decoder to beneﬁt from the work done by the outer decoder, using a kind of feedback. This observation is at the root of turbo processing, “turbo” being used to refer to the way the power of a turbo engine is increased by the reuse of its exhaust gases. This being said, the concatenated decoding scheme of ﬁgure 11.1 does not easily lend itself to such a feedback principle. The code has to be devised in order to enable bidirectional exchanges between the two component decoders.

11.2

Random Coding and Recursive Systematic Convolutional (RSC) Codes The theoretical limits were calculated by Shannon on the basis of random coding, which has since remained the reference in the matter of error correction. The systematic random encoding of a message having k information bits and producing a codeword with n bits may be achieved in the following way. As a ﬁrst step, once and for all, k binary words with n − k bits are drawn at random and memorized. These k words will constitute the basis of a vector space, the ith random word (1 ≤ i ≤ k) being associated with the information message containing only zeros (the “all-zero” message) except in the ith place. The redundant part of any codeword is obtained by calculating the sum modulo two of random words whose address i is such that the ith bit of the original message is one. The coding rate is R = k/n. This very simple construction leads to a very large MHD. Because two codewords diﬀer at least by one information bit, and thanks to the random feature of the redundant part, the mean distance is 1 + n−k 2 . Nevertheless, the MHD of the code being a random value, its diﬀerent realizations may be less than this mean value. A realistic approximation of the actual MHD is n−k 4 . Such large values, for instance 100 for n = 2k =800, are unreachable when using practical codes. Fortunately, these large MHDs are not necessary for common communications systems (Berrou et al., 2003). The device depicted in ﬁg. 11.2 is called a recursive systematic convolutional (RSC) code, whose length, the number of memory elements, is denoted ν. This encoder is based on the principle and the random features of the linear feedback register (LFR), also called a pseudo-random generator. When choosing an appropriate set for the feedback taps, the period P of the LFR is maximum and equal

310

Turbo Processing

to P = 2ν − 1. For suﬃciently large values of ν, P can be much higher than any length of messages to process, by several orders of magnitude. Therefore, the RSC code could then be assimilated to a quasi-perfect random code. Values of ν larger than 30 or 40 would be suﬃcient to make the RSC code equivalent to a random code. The message d = {d0 , . . . di , . . . , dk−1 } to be encoded feeds the LFR input and is transmitted as symbols X, as the systematic part of the codeword. The redundant or parity part is provided by the summation modulo two of certain binary values from the register. Using the D (delay) formalism, the redundant symbols Y are expressed as Y (D) =

G2 (D) d(D), G1 (D)

(11.1)

where G1 (D) = 1 + G2 (D) = 1 +

ν−1 & j=1 ν−1 &

(j)

G1 Dj + Dν and ,

(11.2)

(j)

G2 D j + D ν

j=1 (j)

are the polynomials deﬁning the taps for recursivity and parity construction. G1 (j) (resp. G2 ) is equal to 1 if the register tap at level j (1 ≤ j ≤ ν − 1) is used in the construction of recursivity (resp. parity), and 0 otherwise. G1 (D) and G2 (D) are generally deﬁned in octal forms. For instance, 1 + D3 + D4 is referred to as polynomial 23. Convolutional encoding exhibits a side eﬀect at the end of the coding process, which may be detrimental to decoding performance regarding the last bits of the message. In order to take its decision, the decoder uses information carried by current, past, and subsequent symbols, and the subsequent symbols do not exist at the end of the block. This point is known as the termination problem. Among several solutions to cope with this problem, the classical one consists in adding

in p u t d (D )

o u tp u t X (D ) = d (D ) (s y s te m a tic p a rt) Figure 11.2

s 1

s 2

G G

1

(D ) 2 ( D )

s n

o u tp u t Y (D ) (re d u n d a n t p a rt)

Recursive systematic convolutional (RSC) encoder, with length ν .

11.2

Random Coding and Recursive Systematic Convolutional (RSC) Codes

311

dummy information bits, called tail bits, to make the encoder return to the “allzero” state. Another more elegant technique is tail-biting, also called circular, termination (Weiss et al., 2001). This involves allowing any state as the initial state and encoding the sequence, containing k information bits, so that the ﬁnal state of the encoder register will be equal to the initial state. The trellis of the code (the temporal representation of the possible states of the encoder, from time i = 0 to i = k − 1) can then be regarded as a circle. In what follows, we will refer to circular recursive systematic convolutional (CRSC) codes, the circular version of RSC codes. Thus, without having to pay for any additional information, and therefore without impairing spectral eﬃciency, the convolutional code has become a real block code, in which, for each time i, the past is also the future, and vice versa. RSC or CRSC codes, like classical nonrecursive convolutional codes, are linear codes. Thanks to the linearity property, the code characteristics are expressed with respect to the all-zero sequence. In this case, any nonzero sequence d(D), accompanied by redundancy Y (D), will represent a possible error pattern for the coding/decoding system, one meaning a binary error. Equation 11.1 indicates that only a fraction of sequences d(D), which are multiples of G1 (D), lead to short length redundancy. We call these particular sequences return-to-zero (RTZ) sequences (Podemski et al., 1995), because they force the encoder, if initialized in state 0, to retrieve this state after the encoding of d(D). In what follows, we will be interested only in RTZ patterns, assuming that the decoder will never decide in favor of a sequence whose distance from the all-zero sequence is very large. The fraction of sequences d(D) that are RTZ is exactly p(RTZ) = 2−ν ,

(11.3)

because the encoder has 2ν possible states and an RTZ sequence ﬁnishes systematically at state 0. The shortest RTZ sequence is G1 (D) or its shifted version. Any RTZ sequence, in the block of k bits with circular termination, may be expressed as RTZ(D) = G1 (D)

k−1

ai Di

mod (1 + Dk ),

(11.4)

i=0

where ai takes value 0 or 1. Operation modulo (1+Dk ) transforms all Dx monomials in the resulting product into Dx mod k , for any integer x, so that all exponents are between 0 and k − 1. The minimum number of 1’s belonging to an RTZ sequence is two. This is because G1 (D) is a polynomial with at least two nonzero terms, and equation 11.4 then guarantees that RTZ(D) also has at least two nonzero terms. The number of 1’s in a particular RTZ sequence is called the input weight and is denoted w. We then have wmin = 2 for RSC codes, and the RTZ sequences with weight 2 are of

312

Turbo Processing

the general form P

RTZw=2 (D) = Dτ (1 + Dp )

mod (1 + Dk ),

(11.5)

where τ is the starting time, p any positive integer, and P the period of the encoder, as previously introduced. RTZ sequences with odd weight may either exist or not, depending on the expression of G1 (D). RTZ sequences with even weight always exist, especially of the form RTZw=2l (D) =

l−1

P

Dτj (1 + Dpj )

mod (1 + Dk ),

(11.6)

j=1

that is, as a combination of l any weight-2 RTZ sequences, with τj and pj as any positive integers. This sort of composite RTZ sequence has to be considered closely when trying to design good permutations for turbo codes, as explained in section 11.5. What we are searching for is a very long RSC code, having a large period P , in order to take advantage of quasi-perfect random properties. But such codes cannot be decoded, due to the too-large number of states to consider and to process. That is why other forms of random-like codes, a little more sophisticated, have to be devised.

11.3

Turbo Codes In the previous section, we saw that the probability that any given sequence is an RTZ sequence for a CRSC encoder is 1/2ν . Now, if we encode this sequence N times (ﬁg. 11.3 with ν = 3), each time in a diﬀerent order and drawn at random by permutation Πj (1 ≤ j ≤ N ) (the ﬁrst order may be the natural order), the probability that the sequence remains RTZ for all encoders is lowered to 1/2N ν . For example, with ν = 3 and N = 7, this probability is less than 10−6 . This technique is known as a multiple parallel concatenation of CRSC codes (Berrou et al., 1999). Of course, to deal with realistic coding rates (around 1/2), some puncturing has to be performed, that is, not all the redundant symbols Yj are used to form the codeword. For instance, if R = 1/2, each component encoder provides only k/N parity bits. If the message is not RTZ, after permutation Πj , the average weight of k (assuming that every other bit in {Yj } is 1, statistically). This sequence {Yj } is 2N guarantees a large distance when one permuted sequence, at least, is not RTZ. Fortunately, it is possible to obtain quasi-optimum performance with only two encodings (ﬁg. 11.3), and this is a classical turbo code (Berrou et al., 1993). For bit error rates (BERs) higher than around 10−5 , the permutation may still be drawn at random but, for lower rates, a particular eﬀort has to be made in its design. The way the permutation is devised ﬁxes the MHD dmin of the turbo code, and therefore the achievable asymptotic gain Ga oﬀered by the coding scheme, according to the

11.3

Turbo Codes

313 S y s te m a tic p a rt

k b in a ry d a ta

S y s te m a tic p a rt

C R S C

p e rm u ta tio n P 1 (id e n tity )

Y 1

c o

C R S C

p e rm u ta tio n P 2

C

k b in a ry d a ta

d

1

Y

w o 2

r d

p e rm u ta tio n P

C 2

Y

e w 1

o

C R S C

r d

Y

C R S C

p e rm u ta tio n P N

o d

e Y

c

C R S C

2

N

(b )

(a )

(a) In this multiple parallel concatenation of circular recursive systematic convolutional (CRSC) codes, the block containing k information bits is encoded N times. The probability that the sequence remains of the return-to-zero (RTZ) type after the N permutations, drawn at random (except the ﬁrst one), is very low. The properties of this multiconcatenated code are very close to those of random codes. (b) The number of encodings can be limited to two, provided that permutation Π is judiciously devised. This is a classical turbo code.

Figure 11.3

well-known approximation Ga ≈ 10 log(Rdmin )

(11.7)

The natural coding rate of a turbo code is R = 1/3. In order to obtain higher rates, certain redundant symbols are punctured. For instance, Y1 and Y2 symbols are transmitted alternately to achieve R = 1/2. A particular turbo code is deﬁned by the following parameters. m, the number of bits in the input words. Applications known so far consider binary (m = 1) and double-binary (m = 2) input words (see section 11.6). The component codes C1 and C2 (code memory ν, recursivity and redundancy polynomials). The values of ν are 3 or 4 in practice and the polynomials are generally those that are recognized as the best for simple unidimensional convolutional

314

Turbo Processing

coding, that is, (15,13) for ν = 3 and (23,35) for ν = 4, or their symmetric forms. The permutation function, which plays a decisive role when the target BER is lower than about 10−5 . Above this value, the permutation may follow any law, provided of course that it respects at least the scattering property (the permutation may be the regular one, for instance). The puncturing pattern. This has to be as regular as possible, like for simple convolutional codes. In addition to this rule, the puncturing pattern is deﬁned in close relationship with the permutation function when very low errors rates are sought for.

11.4

Turbo Decoding Decoding a composite code by a global single process is not possible in practice, because of the tremendous number of states to consider. A joint probabilistic process by the decoders of C1 and C2 has to be elaborated, following a kind of divide-andconquer strategy. Because of local latency constraints, this joint process is worked out in an iterative manner in a digital circuit. Analog versions of the turbo decoder are also considered, oﬀering several advantages, as explained in section 11.7. Turbo decoding relies on the following fundamental criterion, which is applicable to all so-called message-passing or belief-propagation algorithms (McEliece et al., 1998): When having several probabilistic machines work together on the estimation of a common set of symbols, all the machines have to give the same decision, with the same probability, about each symbol, as a single (global) decoder would. To make the composite decoder satisfy this criterion, the structure of ﬁg. 11.4 is adopted. The double loop enables both component decoders to beneﬁt from the whole redundancy. The components are soft-in-soft-out (SISO) decoders, permutation Π and inverse permutation Π−1 memories. The node variables of the decoder are logarithms of likelihood ratios (LLRs), also simply called log-likelihood ratios. An LLR related to a particular binary datum di (0 ≤ i ≤ k − 1) is deﬁned, apart from a multiplying factor, as Pr(di = 1) LLR(di ) = ln . (11.8) Pr(di = 0) The role of a SISO decoder is to process an input LLR and, thanks to local redundancy (i.e., y1 for DEC1, y2 for DEC2), to try to improve it. The output LLR of a SISO decoder, for a binary datum, may be simply written as LLRout (di ) = LLRin (di ) + z(di ),

(11.9)

where z(di ) is the extrinsic information about di , provided by the decoder. If this

11.4

Turbo Decoding

315

X C

z P

1

d a ta d (b its )

L L R

p e rm u ta tio n P

C

Y

y

2

1

x

y

(a )

2

in ,1

8 -s ta te S IS O D E C 1

1

8 -s ta te S IS O D E C 2

2

P Y

1

x

L L R

in ,2

(x + z 2)+ z L L R

L L R

o u t,1

d eco d ed o u tp u t o u t,2

(x + z 1)+ z P

- 1

z

1

d

^

2

2

(b )

Figure 11.4 An 8-state turbo code (a) and its associated decoder (b), with a basic structure assuming no delay processing.

works properly, z(di ) is most of the time negative if di = 0, and positive if di = 1. The composite decoder is constructed in such a way that only extrinsic terms are passed by one component decoder to the other. The input LLR to a particular decoder is composed of the sum of two terms: the information symbols (x) stemming from the channel, also called the intrinsic values, and the extrinsic terms (z) provided by the other decoder, which serve as a priori pieces of information. The intrinsic symbols are inputs common to both decoders, which is why extrinsic information does not contain them. In addition, the outgoing extrinsic information does not include the incoming extrinsic information, in order to minimize correlation eﬀects in the loop. The subtractors in ﬁg. 11.4 are used to remove intrinsic and extrinsic information from the feedback loops. Nevertheless, because the blocks have ﬁnite length, correlation eﬀects between extrinsic and intrinsic values may exist and degrade the decoding performance. The practical course of operation is: Step 1: process the data peculiar to one code, say C2 (x and y2 ), by decoder DEC2, and store the extrinsic pieces of information (z2 ) resulting from the decoding in a memory. If data are missing because of puncturing, the corresponding values are set to analog 0 (neutral value). Step 2: process the data speciﬁc to C1 (x, deinterleaved z2 and y1 ) by decoder DEC1, and store the extrinsic pieces of information (z1 ) in a memory. By properly organizing the read/write instructions, the same memory can be used for storing both z1 and z2 . Steps 1 and 2 make up the ﬁrst iteration. Step 3: process C2 again, now taking interleaved z1 into account, and store the updated values of z2 . And so on.

316

Turbo Processing

The process ends after a preestablished number of iterations, or after the decoded block has been estimated as correct, according to some stop criterion (see Matache et al., 2000, for possible stopping rules). The typical number of iterations for the decoding of convolutional turbo codes is four to ten, depending on the constraints relating to complexity, power consumption, and latency. According to the structure of the decoder, after p iterations, the output of DEC1 is LLRout1,p (di ) = (x + z2,p−1 (di )) + z1,p (di ), where zu,p (di ) is the extrinsic piece of information about di , yielded by decoder u after iteration p, and the output of DEC2 is LLRout2,p (di ) = (x + z1,p−1 (di )) + z2,p (di ). If the iterative process converges toward ﬁxed points, z1,p (di ) − z1,p−1 (di ) and z2,p (di ) − z2,p−1 (di ) both tend to zero when p goes to inﬁnity. Therefore, from the equations above, both LLRs become equal, which fulﬁlls the fundamental condition of equal probabilities provided by the component decoders for each datum di . As for the proof of convergence itself, one can refer to various papers dealing with the theoretical aspects of the subject, such as Weiss and Freeman (2001) and Duan and Rimoldi (2001). An important tool for the analysis of convergence is the EXIT chart (ten Brink, 2001). EXIT, which stands for extrinsic information transfer, considers both SISOs decoders in the turbo decoder as nonlinear transfer functions of extrinsic information, in a statistical way. Turbo decoding is not optimal. This is because, during the ﬁrst half-iteration, an iterative process has obviously to begin with only a part of the redundant information available (either y1 or y2 ). Furthermore, correlation eﬀects between noises aﬀecting intrinsic and extrinsic terms may be detrimental. Fortunately, loss due to suboptimality is small, about some tenths of one dB. There are two families of SISO algorithms, those based on the Viterbi algorithm (Battail, 1987; Hagenauer and Hoeher, 1989), which can be used for highthroughput continuous-stream applications; the others based on the APP (a posteriori probability, also called MAP or BCJR) algorithm (Bahl et al., 1974) or its simpliﬁed derived versions (Robertson et al., 1997) for block decoding. If the full APP algorithm is chosen, it is better for extrinsic information to be expressed by probabilities instead of LLRs, which avoids calculating a useless variance for extrinsic terms. In practice, depending on the kind of SISO algorithm chosen, some tuning operations (multiplying, limiting) on extrinsic information are added to the basic structure to ensure stability and convergence within a small number of iterations.

11.5

11.5

Permutation

317

Permutation In a turbo code, permutation plays a double role: 1. It must ensure maximal scattering or spreading of adjacent bits, in order to minimize the correlation eﬀects in the message passing between the two component decoders. 2. It contributes greatly to the value of the MHD. Between a badly designed and a well-designed permutation, the MHD may diﬀer by a factor of 2 or 3. Let us consider the binary turbo code represented in ﬁg. 11.4, with permutation falling on k bits. The worst permutation we can imagine is permutation identity, which minimizes the coding diversity (i.e., Y1 = Y2 ). On the other hand, the best permutation that could be used, but which probably does not exist (Svirid, 1995), could allow the concatenated code to be equivalent to a sequential machine whose irreducible number of states would be 2k+6 . There are actually k + 6 binary storage elements in the structure, k in the permutation memory and 6 in the encoders. Assimilating this machine to a convolutional code would give a very long code and very large minimum distances, for usual values of k. From the worst to the best of permutations, there is great choice between the k! possible combinations, and we still lack a sound theory about this. Nevertheless, good permutations have already been designed to elaborate normalized turbo codes using pragmatic approaches. 11.5.1

Regular Permutation

Maximum spreading (criterion 1 above) is achieved by regular permutation. For a long time, regular permutation was almost exclusively seen as rectangular (linewise writing and columnwise reading in an ad hoc memory, ﬁg. 11.5). When using CRSC codes as the component codes of a turbo code, circular permutation, based on congruence properties, is more appropriate. Circular permutation, for blocks having k information bits (ﬁg. 11.5), is devised as follows. After writing the data in a linear memory, with address i (0 ≤ i ≤ k − 1), the block is likened to a circle, both extremities of the block (i = 0 and i = k − 1) then being contiguous. The data are read out such that the jth datum read was written at the position i given by i = Π(j) = P j

mod k,

where the skip value P is an integer, relatively prime with k. We deﬁne the total spatial distance (or span) S(j1 ,j2 ) as the sum of the two spatial distances, before and after permutation, for a given pair of positions j1 and j2 : S(j1 , j2 ) = f (j1 , j2 ) + f (Π(j1 ), Π(j2 )),

(11.10)

318

Turbo Processing

N columns i=0

i=k-1 i=0

writing M rows

k = M.N

i = P - (k mod. P) i=P

i = 2.P

reading i=k-1

(b)

(a) Figure 11.5

Rectangular (a) and circular (b) permutation.

where f (u, v) = min{|u − v|, k − |u − v|}.

(11.11)

Finally, we denote Smin the minimum value of S(j1 ,j2 ) for all possible pairs j1 and j2 : Smin = min{S(j1 , j2 )}. j1 ,j2

(11.12)

With regular permutation, the value of P that maximizes Smin (Boutillon and Gnaedig, 2005) is √ (11.13) P0 = 2k, with the condition: k=

P0 2

mod P0 ,

(11.14)

which gives: Smin = P0 =

√

2k.

(11.15)

In practice, to comply as far as possible with the criterion of maximum total spatial distance, P is chosen as an integer close to P0 , and prime with k. 11.5.2

Real Permutations

Let us recall that a decoder of an RSC code is only sensitive to error sequences of the RTZ type, as introduced in section 11.2. Then, real permutations have to best satisfy the following ideal rule: If a sequence is RTZ before permutation, then it is not RTZ after permutation, and vice versa.

11.5

Permutation

319

In this case, at least one of the component decoders in ﬁg. 11.4 is able to recover from the errors. But the previous rule is impossible to comply with, and a more realistic target is this one: If a sequence is short RTZ before permutation, then either it is not RTZ or it is long RTZ after permutation, and vice-versa. The dilemma in the design of a good permutation lies in the need to satisfy this practical rule for two distinct classes of codewords, which require conﬂicting treatment. The ﬁrst class contains all nonzero codewords (again with reference to the “all zero” codeword) that are not combinations of simple RTZ sequences, and a good permutation for this class is as regular as possible, which ensures maximum spreading. This type of sequence has low input weight (w ≤ 3). N columns 00...001000000100...00

M rows

X

0

0 (a)

C1

1101

data regular permutation on k = M.N bits

0

0

Y1 C2

(b)

Y2

0 (c)

1101 0000 0000 0000 0000 0000 0000 1101

0

1101 1101 0000 1101

0

Figure 11.6 Some possible RTZ (return to zero) sequences for both encoders C1 and C2 , with G1 (D) = 1 + D + D3 (period L = 7). (a) With input weight w = 2; (b) with w = 3, (c) with w = 6 or 9.

The second class encompasses all codewords that are combinations of simple RTZ sequences, and nonuniformity (controlled disorder) has to be introduced into the permutation function to obtain a large MHD. Figure 11.6 illustrates the situation, showing the example of a 1/3 rate turbo code, using component binary encoders with code memory ν = 3 and periodicity L = 2ν − 1 = 7. For the sake

320

Turbo Processing

of simplicity, the block √ of k bits is organized as a rectangle with M rows and N columns (M ≈ N ≈ k). Regular permutation is used, that is, data are written linewise and read columnwise. Figure 11.6a depicts a situation where encoder C1 (the horizontal one) is fed by an RTZ sequence with input weight w = 2. Redundancy Y1 delivered by this encoder is poor, but redundancy Y2 produced by encoder C2 (the vertical one) is very informative for this pattern, which is also an RTZ sequence but whose span is 7M instead of 7. The associated MHD would be around 7M 2 , which is a large value for typical sizes k. With respect to this w = 2 case, the code is said to be “good” because dmin tends to inﬁnity when k tends to inﬁnity. Figure 11.6b deals with a weight-3 RTZ sequence. Again, whereas the contribution of redundancy Y1 is not high for this pattern, redundancy Y2 gives relevant information over a large span, of length 3M . The conclusions are the same as for the above case. Figure 11.6c shows two examples of sequences with weights w = 6 and w = 9, which are RTZ sequences for encoder C1 as well as for encoder C2 . They are obtained by a combination of two or three minimal-length RTZ sequences. The weight of redundant bits is limited and depends neither on M nor on N . These patterns are typical of codewords that limit the MHD of a turbo code when using a regular permutation. In order to “break” rectangular patterns, some disorder has to be introduced into the permutation rule while ensuring that the good properties of regular permutation, with respect to low weights, are not lost. This is the crucial problem in the search for good permutation, which has not yet found a deﬁnitive answer. Nevertheless, some good permutations have already been devised for recent applications, e.g., Technical Speciﬁcation Groups, IMT-2000 (3GPP, 1999; TIA/EIA/IS, 1999), and DVB (, DVB,D)). Further details about non-regular permutations can be found in Crozier and Guinand (2003) and Berrou et al. (2004).

11.6

Applications of Turbo Codes Depending on the constraints imposed by the application (performance, throughput, latency, complexity, etc.), error correction codes can be divided into many families. We will consider here three domains, related to error rates: Medium error rates (corresponding roughly to 10−2 > BER > 10−6 or 1 > FER > 10−4 ): This is typically the domain of automatic repetition request (ARQ) systems and is also the more favorable level of error rates for turbo codes. To achieve near-optimum performance, eight-state component codes are suﬃcient. Figure 11.7 depicts the practical binary turbo code used for these applications and coding rates equal to or lower than 1/2. For higher rates, the double-binary turbo code of ﬁgure

11.6

Applications of Turbo Codes

321 B A X

k b in a ry d a ta

k /2 b in a ry c o u p le s p e rm u ta tio n

Y 1

Y

P

1

p e rm u ta tio n

P

Y

(a )

2

(b ) Y

2

p o ly n o m ia ls 1 5 , 1 3 (o r 1 3 , 1 5 ) B X

A

k b in a ry d a ta

k /2 b in a ry c o u p le s Y

p e rm u ta tio n 1

Y

P

1

p e rm u ta tio n

P

Y

(c )

2

p o ly n o m ia ls 2 3 , 3 5 (o r 3 1 , 2 7 )

(d ) Y

2

The four turbo codes used in practice. (a) 8-state binary, (b) 8-state double-binary, both with polynomials 15, 13 (or their symmetric form 13, 15), (c) 16-state binary, (d) 16-state double-binary, both with polynomials 23, 35 (or their symmetric form 31, 27). Binary codes are suitable for rates lower than 1/2, doublebinary codes for rates higher than 1/2.

Figure 11.7

11.7 is preferable (Berrou and J´ez´equel, 1999). For each of them, one example of performance, in frame error rate (FER) as a function of signal-to-noise ratio Eb /N0 , is given in ﬁg. 11.8 (UMTS: R = 1/3, k = 640 and DVB-RCS: R = 2/3, k = 1504). Low error rates (10−6 > BER > 10−11 or 10−4 > FER > 10−9 ): sixteen-state turbo codes perform better than eight-state ones, by about 1 dB, for an FER of 10−7 (see ﬁg. 11.8). Depending on the sought-for compromise between performance and decoding complexity, one can choose either one or the other. Figures 11.7d and 11.7d depict the 16-state turbo codes that can be used, the binary one for low rates, the double-binary one for high rates. In order to obtain

322

Turbo Processing

Frame Error Rate 5 10-1 5 10-2 5 10-3 5 10-4 5 10-5 5 10-6 5 10-7 5 10-8 1

2

3

QPSK, 8-state binary, R = 1/3, 640 bits

5

4

Eb/N0 (dB)

QPSK, 8-state double-binary, R = 2/3, 1504 bits

QPSK, 16-state double-binary, R = 2/3, 1504 bits

8-PSK, 16-state double-binary, R = 2/3, 1504 bits, pragmatic coded modulation

Figure 11.8 Some examples of performance, expressed in FER, achievable with turbo codes on Gaussian channels. In all cases: decoding using the Max-Log-APP algorithm with 8 iterations and 4-bit input quantization.

good results at low error rates, the permutation function must be very carefully devised. An example of performance, provided by the association of 8-PSK (phase-shiftkeying) modulation and the turbo code of ﬁg. 11.7d, is also plotted in ﬁg. 11.8, for k = 1, 054 and a spectral eﬃciency of 2 bit/s/Hz. This association is made according to the pragmatic approach, that is, the codec is the same as the one used for binary modulation. It just requires binary-to-octary conversion, at the transmitter side, and the converse at the receiver side. Very low error rates (10−11 > BER or 10−9 > FER): The largest minimum distances that can be obtained from turbo codes, for the time being, are not suﬃcient to prevent a slope change in the BER(Eb /N0 ) or FER(Eb /N0 ) curves, at very low error rates. Compared to what is possible today, an increase of MHDs by roughly 25% would be necessary to make turbo codes attractive for this type of application, such as optical transmission or mass storage error protection. Table 11.1 summarizes the normalized applications of convolutional turbo codes, known to date. The ﬁrst three codes of ﬁg. 11.8 have been chosen for these various systems.

11.7

Analog Turbo Processing Table 11.1

11.7

323

Current Known Applications of (Convolutional) Turbo Codes

Application

Turbo Code

Termination

Polynomials

Rates

CCSDS (deep space)

binary, 16-state

tail bits

23, 33, 25, 37

1/6, 1/4, 1/3, 1/2

UMTS, CDMA2000 (3G mobile)

binary, 8-state

tail bits

13, 15, 17

1/4, 1/2

DVB-RCS (Return channel over satellite)

double-binary, 8-state

circular

15, 13

1/3 up to 6/7

DVB-RCT (Return channel over terrestrial)

double-binary, 8-state

circular

15, 13

1/2, 3/4

Inmarsat (M4)

binary, 16-state

no

23, 35

1/2

Eutelsat (Skyplex)

double-binary, 8-state

circular

15, 13

4/5, 6/7

IEEE 802.16 (WiMAX)

double-binary, 8-state

circular

15, 13

1/2 up to 7/8

1/3,

Analog Turbo Processing The ever-improving performance of A/D and D/A data converters coupled with the achievements of the Moore’s law have resulted today in fully digital processing. Thus, digital signal processors and other digital programmable devices have superseded the analog circuits traditionally used in telecommunications transceivers. The only blocks that have not yet been totally replaced by digital counterparts are the front-end blocks such as ampliﬁers and oscillators. One may ask if this tendency is the most adapted to the design of some receiver functions, in particular error correction having to cope with spurious analog signal. From ﬁg. 11.9, which represents a generic digital transceiver, the following comments about the nature of the signal through the chain can be made. First the data to send are processed by the channel encoder, which adds redundant bits in order to make these data more resilient to channel noise. The resulting digital signal is next modulated and up-converted using a high-frequency carrier. This results in an analog signal that is transmitted over the channel. As the signal propagates through the channel, it is altered by various noise sources which are themselves analog (for example weather or electromagnetic conditions, interference). On the receiver side the corrupted data are ampliﬁed and down-converted in baseband. This is again analog processing. A crucial choice has now to be made. Should the signal be digitized or should it remain analog? Should it take a hard decision or a soft one? The loss of information due to the quantization does not plead in favor of a digital solution. This explains the motivation of some laboratories that propose the implementation of channel decoders in analog form

324

Turbo Processing

(Hagenauer, 1997a; Lustenberger et al., 1999; Moerz et al., 2000). These studies have not only proved the validity of the concepts but have also shown signiﬁcant gains in terms of speed, power consumption, and silicon area over digital solutions (Gaudet and Gulak, 2003; Moerz et al., 2000).

Power Amplifier

Channel Coding

...0110110...

Digital

Modulator

Analog

Noisy Channel Carrier

...0110110 ...

Figure 11.9

Channel Decoding

Digital or Analog

Low Noise Amplifier

Demodulator

Analog

Down Converter

Analog

Generic digital transceiver.

Historically, the analog implementation of error correction algorithms began with some research on the soft-output Viterbi algorithm (SOVA) (Battail, 1987; Hagenauer and Hoeher, 1989). Nevertheless the Viterbi algorithm does not easily lend itself to analog implementation, for which the APP algorithm (Anderson and Hladick, 1998; Bahl et al., 1974) is preferred. It uses only sum/product operators and logarithm/exponential functions. The latter is necessary to convert probabilities into log-likelihood ratios (LLR), as introduced in section 11.4, and vice versa. LLRs are available at the output of a soft demodulator such as that described in Seguin et al. (2004). If X is a binary random variable and x its observation, the LLR is deﬁned by Pr(X = 1|x) . (11.16) LLR(X) = ln Pr(X = 0|x) The exponential and natural logarithm functions are readily available from a bipolar junction transistor (BJT) biased in the forward active region. The collector current IC depends on base-emitter voltage VBE according to the well-known relation: VBE IC ≈ IS exp , (11.17) UT where IS is the saturation current and UT the thermal voltage. When connected as a diode (ﬁg. 11.10a), the transistor produces a voltage V between collector and emitter that depends on current I as I . (11.18) V ≈ UT ln Is Associating a current with a probability, and a voltage with an LLR, it is thus

11.7

Analog Turbo Processing

325

I

I

C 3

Q I

V

C B

# E V

B E

I

I

C 5

Q 4

Q 5

I V

I

C 1

Q

V

Q 1

I I

2

I

i2

1

I

I

b ia s

I

= 1 x ) ù

ëê P r ( X

= 0 x ) ûú

i1

= U

V

i2

= U

C 1

= I

b ia s

P r (Y = 1 y )

C 2

= I

b ia s

P r (Y = 0 y )

C 3

= I

b ia s

P r (Y = 1 y ) P r ( X = 1 x )

C 4

= I

b ia s

P r (Y = 1 y ) P r ( X = 0 x )

C 5

= I

b ia s

P r (Y = 0 y ) P r ( X = 0 x )

C 6

= I

b ia s

P r (Y = 0 y ) P r ( X = 1 x )

6

C 2

é P r(X

V

C 6

i1

I C

Q 3

C 4

ln ê T

T

ú

é P r (Y

= 1 y ) ù

ëê P r ( Y

= 0 y ) ûú

ln ê

ú

(b )

(a )

Figure 11.10 Basic structures for analog decoders. (a) Diode connected bipolar transistor; (b) Gilbert cell.

possible to convert LLRs into probabilities and probabilities into LLRs, by using diﬀerential structures with transistors and diodes. Currents can also be easily added or multiplied in order to satisfy the APP operations. The basic computing block of the decoder is the well-known analog multiplier (ﬁg. 11.10b) called the Gilbert cell (Gilbert, 1968). This cell can convert LLRs into probabilities and multiply them by each other at the same time. The collector currents of transistors Q3 , Q4 , Q5 , Q6 which represent information can also be summed at a certain node of the circuit. When adding a diode to the collector of each transistor, information can be reconverted into LLR form. Thus the APP algorithm can be implemented using a BJT-based network that directly maps the code trellis. 11.7.1

Implementation of the APP Algorithm

The encoder, at time i (0 ≤ i ≤ k − 1), and the trellis section of an R = 1/2 four-state recursive systematic convolutional (RSC) code are depicted in ﬁg. 11.11. This trellis is assumed to be circular, that is, the states at the beginning and at the end of the encoding are equal. After receiving the LLRs stemming from the channel, the frame containing n = 2k symbols is decoded by feeding the decoder inputs in parallel and by letting the analog network converge toward a stable state. The topology of the circular decoder is given in ﬁg. 11.12. The on-chip network is the direct translation of the APP algorithm (Anderson and Hladick, 1998). It is divided into as many sections as the k information bits to decode. Each section is built from several modules: a Γ module to compute the branch metrics,

Turbo Processing

(i)

Xi

(i +1) 0/00

0

1

di

+

D

0 1/11

+

1

1/1

states

326

D

Yi

1

1/10

0/

2

+

0/00

01

2

1/10

3

0/01

3

d/XY

A 4-state RSC encoder with code rate R = 1/2. The input information symbol X is transmitted together with the redundant symbol (parity bit) Y. A trellis section is also shown, whose branches are labeled with the encoded symbols.

Figure 11.11

an A module to compute the forward metrics, a B module to compute the backward metrics, and a Dec module that takes a ﬁnal hard decision on the value of the information bit. The input samples LLR(Xi ) and LLR(Yi ) are associated with the ith couple of transmitted symbols Xi and Yi . The forward and backward metrics αi and βi+1 are yielded, for each branch of the trellis, by the adjacent trellis sections. The outputs of the section are the metrics αi+1 and βi , which are used as inputs to the adjacent sections, as well as the hard decision dˆi . Moreover, in order to implement a turbo decoder, an additional module—Extr—module is required to compute extrinsic information LLRext (Xi ). These values are then used as inputs for the Γ module of another APP decoder. As examples, the Γ and A modules of the four-state decoder are illustrated in ﬁg. 11.13. The branch metrics γ are directly obtained by using the outputs of the Gilbert cell fed with the LLRs. Let αi (s) be the forward metric associated with the state s of the ith section. Let γi (s , s) be the branch metric between any state s of the ith section and one linked state s of the (i + 1)st section. Then the four forward metrics of each section i between 0 and k − 1 are recursively computed as follows: αi+1 (s) =

3

αi (s )γ(s , s).

(11.19)

s =0

Therefore, the Gilbert cell has to be extended to perform the four multiplications and additions, as shown in ﬁg. 11.13. Note that the structures presented require relatively few transistors. This leads to low silicon area and low power consumption, and opens up the way to fully parallel processing. This later property is essential for the design of high-speed decoders.

11.7

Analog Turbo Processing

327

L L R ( : E)

L L R ( ; E)

L L R e x t in ( ( T u r b o o n ly )

G

a

a A

b

i

i+ 1

b B

i

: E)

i+ 1

E x tr

(T u rb o o n ly )

L L R

e x t

( : E)

S o ft r e lia b ility

D e c H a r d D e c is io n

d i

Analog circular APP decoder. There are as many sections as information bits to decode. Each section is made up of modules related to the APP algorithm operations.

Figure 11.12

11.7.2

The Next Step: The Analog Turbo Decoder

As a stand-alone decoder of a simple convolutional code, the APP algorithm does not oﬀer outstanding performance. However, once used in a turbo architecture (Berrou et al., 1993), the APP algorithm reaches its full potential. As shown in ﬁg. 11.14, the two APP decoders exchange information (the so-called extrinsic information), through interleavers, on the reliability of the received data. In a digital version of the turbo decoder, these exchanges are clocked, and the decoding process is repeated as many times as necessary to reach a solution. The complexity and the latency of the decoding process are proportional to the number of iterations, which could be drawbacks for some applications. As can be easily seen, the turbo architecture is well suited for analog implementation since it is a simple feedback system that does not require any internal clocking. In the analog version, the exchanges of extrinsic information are continuous in time. This property, associated with the high degree of parallelism mentioned previously, leads to potential throughput of several Gbit/s, together with low complexity and latency. To give an example, the architecture of a complete DVB-RCS turbo decoder (, DVB) was simulated using behavioral models. The decoding of the smallest frame of this standard (48 double-binary symbols and rate 1/2) conﬁrms the high capacity of the analog turbo decoder (Arzel et al., 2004). Besides the gains in throughput,

328

Turbo Processing

a

g

0 0

g

0 1

g

1 0

g

1 1

E+ 1

(0 )

g

a

0 0

a E( 0 )

g

0 1

g

g

1 0

E+ 1

(1 )

g

a

1 1

g

0 0

a E( 1 )

g

0 1

g

1 0

(2 )

E+ 1

a

g

1 1

0 0

a E( 2 ) 1

g

g

0 1

a E( 3 )

A M o d u le

b ia s

g

1 0

g

0 0

g

0 1

g

1 1

L L R (: )

L L R (; ) 1

Figure 11.13

G M o d u le

Transistor-level design of Γ and A modules.

L L R (X )

A n a lo g A P P D e c o d e r 1 +

L L R (Y 1)

P

L L R (Y 2)

Figure 11.14

b ia s

+

P

X L L R

e x t(

X

)

i1

- 1

P

A n a lo g A P P D e c o d e r 2

The analog turbo decoder.

1 0

L L R

e x t(

X

i2

)

g

1 1

E+ 1

(3 )

11.8

Other Applications of the Turbo Principle

329

complexity, and latency, ﬁg. 11.15 illustrates that analog decoding also yields a gain in performance (about 0.1 dB for a BER of 10−4 ) compared to the digital version. The equivalent digital circuit uses ﬂoating-point number representation and runs for 15 iterations, which provides maximum iterative performance. The analog decoder performs better because it beneﬁts from continuous time: there is no iteration but continuous sharing of extrinsic information.

Analog turbo decoder Digital turbo decoder 15 it.

FER 10

-1

BER for non-coded QPSK

Error Rate

BER 10

-2

10

-3

10

-4

0,8

1,0

1,2

1,4

1,6

1,8

2,0

2,2

2,4

2,6

2,8

3,0

3,2

3,4

3,6

3,8

4,0

E b/N 0 (dB)

Bit and frame error rate curves for the DVB-RCS double-binary turbo code. Frames with k = 96 information bits and R = 1/2. Analog and digital decoding simulation results are compared.

Figure 11.15

In conclusion, the ability of the analog decoder to use soft time, in addition to soft input samples, enhances the error correction of turbo codes while increasing data rates and reducing complexity and latency.

11.8

Other Applications of the Turbo Principle For the sake of clarity we assume a point-to-point transmission with only one transmitter and one receiver. However, this discussion can be extended to more complex situations such as wireless broadcast transmission with a point-to-multipoint transmission. This communication system involves diﬀerent tasks. The aim of each task can be diﬀerent and sometimes contradictory but the system must globally address the following problem: Transmit digital information over the propagation channel with maximum rate and minimum error probability. Actually, the heart of this

330

Turbo Processing

problem is the propagation channel because its physical nature always leads to a limited available frequency bandwidth and a non-ideal frequency response. Consequently, the channel corrupts the transmitted signal by introducing distortion, noise disturbances, and other interference. The general block diagram shown in ﬁg. 11.16 depicts the basic elements of a digital communications system. The source encoder converts the original message into a sequence with as few binary bits as possible in order to save bandwidth and to optimize the transmission rate. Unlike the source encoder, the channel encoder adds redundancy to the compressed message to form codewords. As detailed in the previous sections, the purpose of this function is to increase the reliability of the transmitted data that will be distorted and impaired by the channel. Finally the digital modulator converts the codewords into waveforms that are compatible with the channel. At the receiver part the demodulator performs a conversion of the waveforms into an analog-like sequence. This sequence is then fed to the decoder, which attempts to reconstruct the compressed message from the constraints of the code. Finally, the source decoder reconstructs the original message from the knowledge of the source encoding algorithm. The above description is in fact a simpliﬁed version of a digital communication system, and several other functions have been omitted. The functions in a receiver can be classiﬁed into three categories: estimation, detection, and decoding (ﬁg. 11.17). By estimation, we mean the estimation of certain channel parameters. Carrier Transmitted message

Source encoder

Channel encoder

Modulator Channel

Received message Figure 11.16

Received message

Source decoder

Channel decoder

Demodulator

Basic representation of a digital communications system.

Source decoder

Channel decoder

Detection Channel information

Estimation

Figure 11.17

A more detailed receiver.

Demodulator

11.8

Other Applications of the Turbo Principle

331

phase recovery, frequency oﬀset estimation, timing recovery, and channel impulse response estimation belong to this category. Detection involves all processing of interference introduced by the channel such as intersymbol interference, multiple access interference, cochannel interference, and multi-user interference. Whatever the type of interference, the same mathematical models can be exploited and algorithms based on identical structures and criteria can be derived. The data are ﬁnally recovered by channel decoding. In this conventional representation, each processing step is performed separately. Detection and estimation modules ignore channel and source coding and work as if the received symbols were mutually independent. Information theory informs us that the maximum likelihood (ML) criterion should be globally applied over the whole receiver in order to minimize the error probability, which is the ﬁnal target. Unfortunately, with so many factors to take into account, the complexity becomes prohibitive and the problem totally intractable. Consequently, the optimal ML approach is never implemented in practice and system engineers prefer the conventional receiver of ﬁg. 11.17 despite its suboptimality. Nevertheless, the loss of performance due to this suboptimality is kept as low as possible thanks to the use of soft information. As already mentioned at the beginning of this chapter, the process of making hard decisions discards a part (i.e., the magnitude, as an image of the reliability) of the inherent information present in the received data. For instance, instead of making hard decisions, it is better for the detector to deliver soft information to the decoder. 11.8.1

Feedback Process: The Answer to Suboptimality

Using soft information in the conventional receiver is not suﬃcient to reach the same performance as that of the optimal ML receiver. The loss of performance essentially comes from the fact that the tasks at the front of the receiver, typically detection and estimation, do not beneﬁt from the work of the channel decoder. In a turbo decoder, this problem has been solved by introducing a feedback loop between the component decoders. This loop enables a bidirectional exchange of soft information between these decoders. Each decoder beneﬁts from the work done by the other. The turbo principle can also be applied to solve the suboptimality issue of the conventional receiver. This approach was originally proposed by Douillard et al. (1995) to jointly process equalization and decoding. In a turbo receiver, an iterative process replaces the sequential process of the conventional receiver of ﬁg. 11.17. The received data are now processed several times by the detector, the estimator, and the decoder. Each pass through the estimation/detection/decoding scheme is called an iteration as in a turbo decoder. The iterative receiver is illustrated in ﬁg. 11.18. At the ﬁrst iteration, the detection and the estimation functions do not beneﬁt from any a priori information about the transmitted data. The soft information provided to the decoder is then the same as in the conventional sequential receiver. In turn, the decoder provides its own soft information for estimation/detection, creating a feedback loop. From

332

Turbo Processing

the second iteration, this feedback information acts as a priori information for estimation/detection that improves their results. Figure 11.18 also indicates a turbo process between the channel decoder and the source decoder. In this situation, constraints inherent to the source encoder may help the channel decoder work.

Received message

Source decoder

Channel decoder

Detection

Demodulator

Channel information

Estimation

Illustration of a simpliﬁed turbo receiver. The permutation functions are not represented.

Figure 11.18

Thus a turbo receiver involves the repetition of information exchanges a number of times, the channel decoder playing a central role thanks to the amount of redundancy it beneﬁts from. If the SNR is not too bad, the turbo receiver produces new and better estimates at each iteration. Nevertheless, this performance improvement is bounded by the ML performance, which is impassable. In section 11.5, the role and the importance of the permutation function were emphasized. In the turbo receiver, permutations (which are not represented in ﬁg. 11.18) also play a crucial role by minimizing the correlation eﬀects in consecutive pieces of information exchanged by the diﬀerent processors. In particular, an appropriate permutation located after the decoder avoids the consequences of possible residual error bursts. The implementation of the turbo receiver assumes that all functions are able to accept and deliver soft extrinsic information. In particular, the APP algorithm is particularly well suited to the decoder, especially if it is a turbo decoder. 11.8.2

Probabilistic Detection and Estimation Algorithms

Actually, before the invention of turbo coding and decoding, probabilistic algorithms with a priori probabilities for conventional detection or estimation tasks were not needed and so, they were not known! Consequently, before implementing an iterative process between estimation/detection and decoding tasks, probabilistic algorithms for estimation and detection have to be elaborated. In fact, the APP algorithm suits the problem of detection well, whereas the expectation maximization (EM) algorithm (Moon, 1996), also a probabilistic algorithm, is more adapted to the problem of estimation. Lots of iterative digital receivers are based on these two algorithms, whose major advantage is optimality. Furthermore, iterative exchanges

11.8

Other Applications of the Turbo Principle

333

between detection, or estimation, and decoding can be implemented directly thanks to a priori probabilities. However, these algorithms are computationally demanding, and the resulting complexity of the receiver may become prohibitive. In this situation, suboptimal alternatives are needed. 11.8.3

Suboptimal Algorithms

The principles of these alternatives were already known for conventional detection and estimation. They are suboptimal because they are based on criteria other than minimizing the error probability, such as minimizing the mean square error (minimum mean square error, MMSE) or maximizing the SNR. But these suboptimal algorithms also lead to less complex implementations, often in the form of ﬁlterbased structures. As these structures were traditionally used in the conventional receiver of ﬁg. 11.17, no a priori input was required. In order to exchange information with the channel decoder, detection and estimation algorithms have to take into account a priori probabilities and also deliver probabilities. Two solutions have been considered in the literature to implement soft-in/softout detectors and estimators. The ﬁrst one involves modifying or adapting preexisting techniques. The second one is to derive totally new architectures. Let us consider the ﬁrst approach. The problem can be formulated as: How can the a priori input sequence be utilized in a conventional estimation/detection algorithm, whereas so far they only involved the processing of the received data? In fact, several estimation or detection algorithms already use feedback of estimated data, for example, the DFE (decision feedback equalizer) for combating intersymbol interference (Proakis, 2000) and the SIC (successive interference cancellation) receiver for multi-user transmission (see Dai and Poor, 2002, for instance). An illustration of these locked-up schemes is given in ﬁg. 11.19. The loop generally involves hard decisions on symbols, provided by a hard slicer, that are utilized to improve the estimation/detection result. The derivation of the algorithms is generally based on the assumption that the hard decisions fed back to the detection/estimation algorithm correspond exactly to the transmitted symbols. In a turbo process, the decoder Hard decision

Detection

From the demodulator

Channel information Estimation

Figure 11.19

scheme.

Illustration of a conventional locked-up estimation/detection

334

Turbo Processing

can replace the hard slicer and now provide improved feedback information. The detection or estimation algorithms are not modiﬁed and are still based on the assumption of perfect feedback. As the channel decoder provides probabilities and the detection/estimation algorithms symbol estimates, a mapper is needed between the channel decoder and the estimation/detection algorithms. This mapper generates an estimate of transmitted symbols from the a posteriori or extrinsic probabilities provided by the decoder. This estimation is then directly used as feedback into the suboptimal detection and estimation algorithms, as explained previously. However, more potential gain is available if the assumption of perfect feedback is dropped in the derivation of the algorithms. Indeed, the assumption of perfect feedback in a turbo process is not valid since the data estimates are improved iteration after iteration. Perfect feedback is achieved only when the convergence of the turbo process is attained. Consequently, retaining the assumption of perfect feedback in the derivation of the detection/estimation algorithms, as in the ﬁrst approach, generally leads to nonoptimal solution. A new derivation of the algorithm becomes necessary. The derivation is still based on the initial suboptimal criteria, MMSE for example. But, in addition, feedback decoder probabilities, replacing the perfect feedback assumption, are introduced into the derivation. Very powerful algorithms are thus derived, generally diﬀerent from conventional algorithms and often based on an interference canceller structure. Furthermore, they oﬀer very signiﬁcant complexity advantages over probabilistic algorithms. Iterative schemes based on well-designed suboptimal algorithms achieve very good performance and can potentially lead to exactly the same performance bound as iterative schemes based on probabilistic algorithms. The major diﬀerence is the convergence speed. More iterations are sometimes necessary to achieve the performance bound. In conclusion, the turbo principle has found numerous applications in communication receivers. It has proved its capacity to achieve performance very close to that of the ML receiver, while requiring signiﬁcantly reduced complexity. 11.8.4

Further References

Because the literature in the domain of turbo receivers is huge, we will restrict references to pioneering papers or overview papers. As already mentioned, the turbo principle was extended for the ﬁrst time to the problem of joint equalization and decoding in 1995 (Douillard et al., 1995). This new turbo receiver, called a turbo detector, was based on an APP equalizer and an APP channel decoder, and demonstrated quasi-optimal performance. The second successful attempt concerned coded modulation. Robertson and W¨ orz (1996) introduced an eﬃcient coding scheme for high-order modulations based on the parallel concatenation of Ungerboeck codes (Ungerboeck, 1982). This technique, named turbo trellis coded modulation (TTCM), provides good performance on AWGN channels.

11.8

Other Applications of the Turbo Principle

335

Later, Glavieux et al. (1997) proposed a low complexity version of a turbo equalizer based on adaptive MMSE ﬁlters . For the ﬁrst time, a nonprobabilistic algorithm was introduced into a turbo receiver. Here, a priori information is taken into account in the ﬁlter coeﬃcient computation thanks to an adaptive algorithm such as the least mean square (LMS) algorithm. At the same time, Hagenauer introduced the “turbo” principle, extending this concept to several tasks in a communication system: joint source and channel decoding, coded modulation, and multi-user detection (Hagenauer, 1997b). Following these pioneering works, a huge number of papers was then devoted to the turbo principle. For instance, a bit-interleaved coded modulation (BICM) turbo decoder for ergodic channels has been proposed by ten Brink et al. (1998) and Chindapol et al. (1999) independently. This technique, called iterative demapping or BICM-ID (iterative decoder), is based on a feedback loop between the demapper, which computes LLRs, and the channel decoder. The a priori information provided by the decoder is used in the soft demapper to remove the assumption of independent and identically distributed coded bits. This technique needs nonGray mapping in order to provide performance gains. In the domain of synchronization, Langlais and H´elard (2000) have addressed the problem of carrier phase recovery for turbo decoding at low SNRs, by using tentative decisions from the ﬁrst component decoder in the carrier phase recovery loop. As this system does not exploit the iterative structure of the turbo decoder, Lottici and Luise (2002) then developed a carrier phase recovery embedded in turbo decoding. Recently, Barry et al. (2004) have proposed an overview of methods for implementing timing recovery with turbo decoding. In the past few years many signiﬁcant developments have arisen in the ﬁeld of iterative multi-user detection. A number of references can be found in the paper by Poor (2004). Finally, the turbo principle can also be applied to multiple-antenna detection as in MIMO (multi-input multi-output) systems. The famous BLAST system from Bell Labs has thus been transposed in a turbo receiver, in which the multi-antenna detector exchanges information with the channel decoder (Haykin et al., 2004). Optimal and suboptimal multiple-antenna detectors are also presented, leading to various implementation complexities.

12

Blind Signal Processing Based on Data Geometric Properties

Konstantinos Diamantaras

12.1

Introduction Blind signal processing deals with the outputs of unknown systems excited by unknown inputs. At ﬁrst sight the problem seems intractable, but a closer look reveals that certain signal properties allow us to extract the inputs or to identify the system up to some, usually not important, ambiguities. Linear systems are mathematically most tractable and, naturally, they have attracted most of the attention. Depending on the type of the linear system, blind problems arise in a wide variety of applications, for example, in digital communications (Diamantaras and Papadimitriou, 2004a,b; Diamantaras et al., 2000; Godard; Papadias and Paulraj, 1997; Paulraj and Papadias, 1997; Shalvi and Weinstein, 1990; Talwar et al., 1994; Tong et al., 1994; Torlak and Xu, 1997; Treichler and Agee, 1983; Tsatsanis and Giannakis, 1997; van der Veen and Paulraj, 1996; van der Veen et al., 1995; Yellin and Weinstein, 1996), in biomedical signal processing (Choi et al., 2000; Cichocki et al., 1999; Jung et al., 1998; Makeig et al., 1995, 1997; McKeown et al., 1998; Vig´ ario et al., 2000), in acoustics and speech processing (Douglas and Sun, 2003; Parra and Spence, 2000; Parra and Alvino, 2002; Shamsunder and Giannakis, 1997), etc. Many recent books on the subject (Cichocki and Amari, 2002; Haykin, 2001a,b; Hyv¨ arinen et al., 2001) provide extensive discussion on related problems and methods. The most general ﬁnite, linear, time invariant (LTI) system is expressed by a multichannel convolution of length L, operating on a discrete vector signal s(k) = [s1 (k), · · · , sn (k)]T , x(k) =

L−1

Hi s(k − i).

(12.1)

i=0

The FIR ﬁlter taps Hi are complex matrices, in general, of size m × n, m ≥ 1. Thus the output is an m-dimensional complex vector x(k). For n, m > 1, equation 12.1 describes a linear, discrete, multi-input multi-output (MIMO) system.

338

Blind Signal Processing Based on Data Geometric Properties

12.1.1

Types of Mixing Systems

We shall study two special cases of system 12.1 sharing many similarities but also having some special characteristics as described below: Instantaneous Mixtures: In this case we have more than one source and more than one observation, i.e., m, n > 1, but there is no convolution involved, so L = 1. The output vector is produced by a linear, instantaneous transformation: x(k) = Hs(k).

(12.2)

This type of system is also called memoryless. Single-Input Single-Output (SISO) Convolution: In this case we have exactly one source and one observation, so m, n = 1, but the convolution is nontrivial, i.e., L > 1. x(k) =

L−1

hi s(k − i).

(12.3)

i=0

equation 12.3 describes a linear, SISO FIR ﬁlter. 12.1.2

Types of Blind Problems

Regardless of the speciﬁc system type, there are two kinds of blind problems which are of interest here, depending on whether we desire to extract the input signals or the system parameters. Blind Source Extraction: In this type of problem our goal is to recover the source(s) given the observation signal x(k) or x(k). If there are more than one source the problem is called blind source separation (BSS ). In the case of BSS the linear system may be either instantaneous or convolutive (general MIMO). In the case of blind deconvolution (BD) we want to invert a linear ﬁlter which, of course, operates on its input via the convolution operator, hence the name deconvolution attributed to this problem. The problem is very important, for example, in wireless communications, where n transmitted signals corrupted by intersymbol interference (ISI), multi-user interference (MUI), and noise are received at m antennas. The source separation/extraction problem has an inherent ambiguity in the order and the scale of the sources: the original signals can not be retrieved in their original order or scale unless some further information is available. For example, if the source samples (symbols) are drawn from a known ﬁnite alphabet then there is no ambiguity in the scale. If however, the alphabet is symmetric with respect to zero, then there exists a sign ambiguity since both signals s(k) and −s(k) are plausible. Furthermore, the ordering ambiguity is always present if the problem involves more than one source.

12.1

Introduction

339

Blind System Identiﬁcation: In this type of problem our goal is to obtain the system parameters rather than recovering the source signals. If the system is memoryless then our goal is to recover the mixing matrix H. If the system involves nontrivial convolution then the goal is to extract the ﬁlter taps h0 , . . . , hL−1 , or H0 , . . . , HL−1 . 12.1.3

Approaches to Blind Signal Processing

Typically, blind problems are approached either using statistical properties of the signals involved, or exploiting the geometric structure of the data constellation, as described next. Higher-Order Methods: According to the central limit theorem, the system output—which is the sum of many input samples—will approach the Gaussian distribution, irrespective of the input distribution. A characteristic property of the Gaussian distribution is that all higher-order cumulants (for instance, the kurtosis) are zero. If the inputs are not normally distributed, their higher-order cumulants will be nonzero, for example positive, and so equation 12.1 will work as a “cumulant reducer.” Clearly, the blind system inversion—the linear transform that will recover the sources from the output—should function as a “cumulant increaser,” i.e., it should maximize the absolute cumulant value for a given signal power. In fact, this is the basic idea behind all higher-order methods. 1. Second-Order Methods: Alternatively, second-order methods can be applied when the sources have colored spectra, regardless of their distribution. If the source colors are not identical then the time-delayed covariance matrices have a certain eigenvalue structure which reveals the mixing operator, in the memoryless case. This information can be used for recovering the sources as well. In the dynamic case, things are more complicated, although, again, second-order methods have been proposed based on the statistics of either the frequency or the time domain. 2. A Third Approach: Exploiting The Signal Geometry: Neither higher-order nor second-order methods exploit the cluster structure or shape of the input data when such a structure or shape exists. Consider for example a source signal s(k) whose samples are drawn from a ﬁnite alphabet AM = {±1, · · · ± (M/2)} (M = even). Let the SISO FIR ﬁlter described in equation 12.3 be excited by s(k). Writing N equations (N ≥ L) of the form (in equation 12.3) for N consecutive values of k, we obtain the following matrix equation: ⎤ ⎡ ⎤⎡ ⎡ s(k) s(k − 1) · · · s(k + 1 − L) h0 x(k) ⎥ ⎢ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎢ h1 ⎢ s(k + 1) s(k) s(k + 2 − L) x(k + 1) ⎥ ⎢ ⎥⎢ ⎢ ⎥=⎢ ⎢ ⎥⎢ . .. .. . .. . ⎢ ⎥ ⎢ ⎥ ⎢ .. . . . . ⎦ ⎣ ⎣ ⎦⎣ x(k + N − 1) hL−1 s(k + N − 1) s(k + N − 2) · · · s(k + N − L) (12.4) x = Sh

(12.5)

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

340

Blind Signal Processing Based on Data Geometric Properties

where the N × L Toeplitz matrix S involves (N + L − 1) unknown input symbols. It is possible, in principle, to identify h in a deterministic way by an exhaustive search over all M N +L−1 possible S’s such that minh x − Sh 2 = 0. Although it is highly impractical, this observation tells us that there is more to blind signal processing than statistical processing. If the sources, for example, have a certain structure which produces clusters in the data cloud, or the input distribution is bounded (e.g., uniform), then one can exploit the geometric properties of the output constellation and derive fast and eﬃcient deterministic algorithms for blind signal processing. These methods are treated in this chapter. In particular, section 12.2 discusses blind methods for systems with ﬁnite alphabet sources. The discussion covers both the instantaneous and the convolutive mixtures and it is based on the geometric properties of the data cloud. Section 12.3 discusses the case of continuousvalued sources that are either sparse or have a speciﬁc input distribution, for example, uniform. Our discussion on continuous sources covers only the case of instantaneous systems. Certainly there is a lot of room for innovation along this line of research since many issues, today, remain open.

12.2

Finite Alphabet Sources Blind problems involving sources with ﬁnite alphabets (FA) have drawn a lot of attention, because such types of signals are common in digital communications. Popular modulation schemes, for instance, quadrature amplitude modulation QAM), pulse amplitude modulation (PAM) and binary phase-shift keying (BPSK), produce signals with limited numbers of symbols. A large body of literature exists on the instantaneous mixture problem, not only because it is the simplest one but also because most methods dealing with the more realistic convolutive mixture problem lead to the solution of an instantaneous problem. In Anand et al. (1995), the blind separation of binary sources from instantaneous mixtures is approached using separate clustering and bit-assignment algorithms. An extension of this method is presented in Kannan and Reddy (1997), where a maximum likelihood (ML) estimate of the cluster centers is provided. Talwar et al. (1996) presented two iterative leastsquares methods: ILSP (iterative least squares with projection) and ILSE (iterative least squares with enumeration) for the BSS of binary sources. The same problem is treated in van der Veen (1997), where the real analytical constant modulus algorithm (RACMA) is introduced based on the singular value decomposition (SVD) of the observation matrix. In Pajunen (1997), an iterative algorithm is proposed for the blind separation of more binary sources than sensors. Finite alphabet sources and instantaneous mixtures are discussed in Belouchrani and Cardoso (1994) where a ML approach is proposed using the EM algorithm. Grellier and Comon (1998) introduce a polynomial criterion and a related minimization algorithm to separate FA sources. In all the above methods the geometric properties of the data cloud are not explicitly used. Geometrical concepts, such as the relative distances between the cluster centers, were introduced in Diamantaras (2000) and Diamantaras

12.2

Finite Alphabet Sources

341

and Chassioti (2000). It turns out that just one observation signal is suﬃcient for blindly separating n binary sources, in the noise-free case, under mild assumptions. A similar algorithm based on geometric concepts was later proposed in Li et al. (2003). In this section we shall study the geometric structure of data constellations generated from linear systems operating on signals with ﬁnite alphabets. We’ll ﬁnd that the geometry of the obtained data cloud contains information pertaining to the generating linear operator. This information can be exploited either for the blind extraction of the system parameters or for the blind retrieval of the original sources. 12.2.1

Instantaneous Mixtures Of Binary Sources

The simplest alphabet is the two-element set, or binary alphabet Aa = {−1, 1}. We shall assume that the samples of some source signals are drawn from Aa , and the signals will be called binary antipodal or, simply, binary. In digital communications the carrier modulation scheme using symbols from Aa is called binary phase-shift keying (BPSK). The reader is encouraged to verify that our results can be easily generalized to any type of binary alphabet, for example, the nonsymmetric set Ab = {0, 1}. In this subsection we shall concentrate on problem type 1, i.e., on linear memoryless mixtures of many sources, n > 1. Depending on the number of output signals (observations) m, we treat three distinct cases: m = 1; m = 2; and m > 2. 12.2.1.1

A Single Mixture

The instantaneous mixture of n sources linearly combined into a single observation is described by the following equation: x(k) =

n

hi si (k) = hT s(k),

(12.6)

i=1

h = [h1 · · · hn ]T ,

s(k) = [s1 (k) · · · sn (k)]T .

We assume that the mixing coeﬃcients hi are real and that si (k) ∈ Aa . If the coeﬃcients are complex, then the problem corresponds to the case m = 2, which is treated later. We start by studying the noise-free system since our primary interest is to investigate the structural properties of the signals and not to develop methods to combat the noise. Of course, eventually, the development of a viable algorithm will have to deal with the noise issue.

342

Blind Signal Processing Based on Data Geometric Properties

Equation 12.6 can be seen as the projection x ˜(k) of s(k) along the direction of ˜ scaled by h : the normal vector h,

two sources

x(k) = h x ˜(k), T ˜ s(k), x ˜(k) = h

(12.8)

˜ = h/ h . h

(12.9)

(12.7)

The set of values of x(k) will be called the constellation of x(k) and it will be denoted by X . It is a set of (at most) 2n points in 1-D space, R. In order to facilitate our understanding of the geometric structure of X , let us start by assuming that there are only n = 2 sources. Thus, there exist four possible realizations of the vector s(k), which form the source constellation S = {s−− , s−+ , s+− , s++ }, where s−− = [−1, −1]T , s−+ = [−1, 1]T , s+− = [1, −1]T , and s++ = [1, 1]T . Consequently, the output constellation X also consists of four distinct values: ˜−− = hT s−− , x−− = h x ˜−+ = hT s−+ , x−+ = h x ˜+− = hT s+− , x+− = h x ˜++ = hT s++ . x++ = h x Figure 12.1 shows the projections x ˜−− , x ˜−+ , x ˜+− , x ˜++ , of the source constellation ˜ It is obvious that the relative distance S for four diﬀerent normal mixing vectors h. between the points on the projection line is a function of the angle θ between the projection line and the horizontal axis. Apparently, the problem involves a lot of symmetry. In particular, it is straightforward to verify that we obtain the same output constellation X for the angles ±θ, ±(π/2 − θ), ±(π − θ), and ±(3π/2 − θ), (any θ). This multiple symmetry is the result of the interchangeability of the two sources, s1 and s2 , as well as the invariance of the source constellation to sign changes. These ambiguities are, however, acceptable, since it is not possible to recover the original source order or the original source signs. Both the source order and the sign are unobservable as it is eminent from the following relations: x(k) = [±h1 , ±h2 ] [±s1 (k), ±s2 (k)]T , = [±h2 , ±h1 ] [±s2 (k), ±s1 (k)]T . Therefore, let us assume, without loss of generality, that the mixing vector h satisﬁes the following constraint h1 > h2 > 0.

(12.10)

Under this assumption, the elements of X are ordered: x−− = −h1 − h2 < x−+ = −h1 + h2 < x+− = +h1 − h2 < x++ = +h1 + h2 . (12.11)

343

1.5

1

1

0.5

0.5

2

1.5

0

s

s

2

Finite Alphabet Sources

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−1

−0.5

0

0.5

1

−1.5

1.5

−1.5

−1

−0.5

1.5

1

1

0.5

0.5

2

1.5

0

−0.5

−1

−1

−1.5

−1

−0.5

0

s

0.5

1

1.5

0.5

1

1.5

0

−0.5

−1.5

0

s1

s

2

s1

s

12.2

0.5

1

1.5

−1.5

−1.5

−1

−0.5

0

s

1

1

The source constellation (circles) of two independent binary sources is projected on four diﬀerent directions. The relative distances of the projection points (marked by squares) is clearly a function of the slope of the projection line.

Figure 12.1

Indeed, the ﬁrst and third inequalities in 12.11 are obvious since the mixing coeﬃcients are positive. The second inequality is also true since x+− − x−+ = 2(h1 − h2 ) > 0. Thus, by clustering the (observable) output sequence {x(1), x(2), x(3), · · · } we obtain four cluster points c1 , c2 , c3 , c4 , which can be arranged in increasing order and set into one-to-one correspondence with the elements of X . c1 = x−− < c2 = x−+ < c3 = x+− < c4 = x++ c1 = −c4 ;

(12.12)

c2 = −c3 .

Then using equation 12.11 we can recover the mixing parameters: h1 = (c3 − c1 )/2,

(12.13)

h2 = (c2 − c1 )/2.

(12.14)

344

Blind Signal Processing Based on Data Geometric Properties

Example 12.1 Figure 12.2 shows the position of the cluster points c1 , . . . , c4 , for the random mixing vector h = [0.9659, 0.2588]T . According to equations 12.11 and 12.12, these cluster points are c1 = x−−

= −1.2247

−+

= −0.7071

+−

c3 = x

=

0.7071

c4 = x++

=

1.2247

c2 = x

.

By computing the distances between the pairs (c3 , c1 ) and (c2 , c1 ), we obtain directly the unknown mixing parameters: (c3 − c1 )/2

=

0.9659 = h1

(c2 − c1 )/2

=

0.2588 = h2

.

1

0.5

2h1 2h2

0

c1=x−−

c2=x−+

−1

−0.5

c3=x+−

c4=x++

−0.5

−1.5

0

0.5

1

1.5

The distances c3 –c1 and c2 –c1 between the cluster points are equal to twice the size of the unknown mixing parameters.

Figure 12.2

If our aim is to identify the mixing parameters h1 , h2 , then equations 12.13 and 12.14 have achieved our goal. If, in addition, we want to extract the hidden sources then we may estimate each input sample s(k), separately, by ﬁnding the binary vector b = [b1 , b2 ]T ∈ A2a , so that hT b best approximates x(k). This corresponds to the following binary optimization problem, ˆ s(k) = arg min2 |x(k) − hT b|, b∈Aa

for all k.

(12.15)

Luckily the above optimization problems are decoupled, for diﬀerent k, and therefore the solution is trivial.

12.2

Finite Alphabet Sources

more than sources

2

345

The whole idea can be extended to more than two sources using recursive system deﬂation. This process iteratively identiﬁes and removes the two smallest mixing parameters, thus eventually reducing the problem to either the two-input case, which is solved as above; or the single-input case, which is trivial. Our linear mixture model is again the one described in 12.6 with n > 2 and some real mixing vector h = [h1 , · · · , hn ]T . As before, without loss of generality, we shall assume that the mixing parameters are positive and arranged in decreasing order: h1 > h2 > · · · > hn > 0.

(12.16)

We have already shown that for n = 2 the centers ci are arranged in increasing order. For n > 2 things are a bit more complicated. Let us deﬁne B(n) to be the (n) T 2n × n matrix whose ith row bi , is the binary representation of the number (i − 1) ∈ {0, · · · , 2n − 1}: ⎤ ⎡ −1 −1 · · · −1 −1 ⎥ ⎢ ⎢ −1 −1 · · · −1 1 ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ −1 −1 · · · 1 −1 ⎥ ⎥ ⎢ ⎢ . .. .. .. ⎥ . (12.17) B(n) = ⎢ .. . . . ⎥ ⎥ ⎢ ⎥ ⎢ 1 · · · −1 1 ⎥ ⎢ 1 ⎥ ⎢ ⎢ 1 1 ··· 1 −1 ⎥ ⎦ ⎣ 1 1 ··· 1 1 Although the sequence {c1 , · · · , c2n }, of the centers (n) T

ci = bi

h=

n

(n)

bij hj ,

i = 1, · · · , 2n

(12.18)

j=1

is not exactly arranged in increasing (or decreasing) order, there is a lot of structure in the sequence as summarized by the following facts (Diamantaras and Chassioti, 2000): The ﬁrst three centers c1 < c2 < c3 are the three smallest values in the sequence ci . Similarly, the last three centers c2n −2 < c2n −1 < c2n are the three largest values in the sequence {ci }. The sequence c1 , . . . , c2n , deﬁned by 12.18 consists of consecutive quadruples, each arranged in increasing order: c4i+1 < c4i+2 < c4i+3 < c4i+4 , i = 0, · · · , 2n−2 − 1 The smallest element of the ith quadruple is n−2

c4i+1 = [

j=1

(n)

b4i+1,j hj ] − hn−1 − hn .

(12.19)

346

Blind Signal Processing Based on Data Geometric Properties

The diﬀerences δ1 = c4i+2 − c4i+1 = 2hn

(12.20)

δ2 = c4i+3 − c4i+1 = 2hn−1

(12.21)

δ3 = c4i+4 − c4i+1 = 2(hn−1 + hn )

(12.22)

between the members of the ith quadruple are independent of i. Since c2 = c1 + 2hn , and c3 = c1 + 2hn−1 the two smallest mixing parameters hn−1 , hn can be retrieved using the values of the three smallest centers c1 , c2 , and c3 : hn = (c2 − c1 )/2 ,

(12.23)

hn−1 = (c3 − c1 )/2 .

(12.24)

Once we have obtained hn−1 and hn we can deﬁne a new sequence {ci } by picking the ﬁrst elements of each quadruple shifted by the sum (hn−1 + hn ), thus obtaining ci = c4(i−1)+1 + hn−1 + hn =

n−2

(n)

b4(i−1)+1,j hj ,

i = 1, · · · , 2n−2 .

(12.25)

j=1

Notice however, that the ﬁrst n − 2 bits of the [4(i − 1) + 1]-th row of B(n) are all the bits of the ith row of B(n−2) . In other words, (n)

(n−2)

b4(i−1)+1,j = bij

,

j = 1, · · · , n − 2

therefore, (n−2) T

ci = bi

h=

n−2

(n−2)

bij

hj ,

i = 1, · · · , 2n−2 .

(12.26)

j=1

Using these facts the following recursive algorithm is constructed: Algorithm 12.1 : n binary sources, one observation Step 1: Compute the centers ci and sort them in increasing order. Step 2: Compute hn , hn−1 , from equations 12.23 and 12.24. Step 3: Compute the diﬀerences δi , using equations 12.20, 12.21, and 12.22. Step 4: Remove the set {c1 , c2 , c3 , c1 + δ3 } from the sequence {ci }. Set c1 = c1 + hn + hn−1 as the ﬁrst element of a new sequence {ci }. Step 5: Repeat until all elements have been removed: Find the smallest element cj of the remaining sequence {ci }; Remove the set {cj , cj + δ1 , cj + δ2 , cj + δ3 } from {ci };

12.2

Finite Alphabet Sources

347

Keep cj + hn + hn−1 as the next element of the sequence {ci }. At the end, the new sequence {ci } will be four times shorter than the original {ci }. Step 6: Recursively repeat the algorithm for the new sequence {ci } and for a new n = n − 2 to obtain hn = hn−2 , hn −1 = hn−3 . Eventually, n = 2 or n = 1. Steps 4 and 5 are the basic recursion which reduces the problem size from n to n−2 by replacing the sequence ci by ci . At step 6, we will iteratively obtain the pairs (hn , hn−1 ), (hn−2 , hn−3 ), . . . , until we reach the case where n = 2 or n = 1. The case for n = 2 sources was treated in the previous subsection. The case for n = 1 is trivial since it involves only one source. In this case, the observation is simply a scaled version of the input, x(k) = h1 s1 (k), thus, the estimation of h1 and s(k) is easy: we have h1 = |x(k)| (since |s(k) = 1| and h1 > 0) and so s(k) = x(k)/h1 . Example 12.2 Consider the following system with four sources and one observation: x(k) = −0.4326s1 (k) + 1.2656s2 (k) + 0.1553s3 (k) − 0.2877s4 (k). The mixing vector h = [−0.4326, 1.2656, 0.1553, −0.2877] does not satisfy equation ˆ = [1.2656, 0.4326, 0.2877, 0.1553], 12.16. The algorithm will recover the vector h which does satisfy equation 12.16, and it is identical to h except for the permutation and sign changes of its elements. Step 1: The sorted sequence of centers is c={

−2.1412, −1.8306, −1.5658, −1.2760, −1.2552, −0.9654, −0.7006, −0.3900, 0.3900, 0.7006, 0.9654, 1.2552, 1.2760, 1.5658, 1.8306, 2.1412 }

ˆ 3 = 0.2877, h ˆ 4 = 0.1553. Step 2: Using equations 12.23 and 12.24 we compute h Step 3: Using equations 12.20, 12.21, and 12.22, we obtain δ1 = 0.3106, δ2 = 0.5754, δ3 = 0.8860. Step 4: Remove {c1 , c2 , c3 , c1 + δ3 } = {−2.1412, −1.8306, −1.5658, −1.2552} from c. Set c1 = −1.6982. New sorted sequence: c={

−1.2760, −0.9654, −0.7006, −0.3900, 0.3900, 0.7006, 0.9654, 1.2552, 1.2760, 1.5658, 1.8306, 2.1412 }.

Step 5: Remove {c1 , c1 +δ1 , c1 +δ2 , c1 +δ3 } = {−1.2760, −0.9654, −0.7006, −0.3900} from c. Set c2 = −0.8330. New sorted sequence: c = {0.3900, 0.7006, 0.9654, 1.2552, 1.2760, 1.5658, 1.8306, 2.1412}. Step 6: Remove {c1 , c1 + δ1 , c1 + δ2 , c1 + δ3 } = {0.3900, 0.7006, 0.9654, 1.2760}. Set c3 = 0.8330. New sorted sequence: c = {1.2552, 1.5658, 1.8306, 2.1412}.

348

Blind Signal Processing Based on Data Geometric Properties

Step 6: Remove {c1 , c1 + δ1 , c1 + δ2 , c1 + δ3 } = {1.2552, 1.5658, 1.8306, 2.1412}. Set c4 = 1.6982. New sorted sequence: c = ∅. The new sequence c = {−1.6982, −0.8330, 0.8330, 1.6982} yields the estimates ˆ 2 = 0.4326. ˆ 1 = 1.2656, h of the remaining mixing parameters h 12.2.1.2

Two Mixtures

In the case of m = 2 mixtures the observed data x(k) lie in the two-dimensional space R2 . Although it is possible to see each mixture separately as a single-mixture– multiple-sources problem, as the one treated in the previous subsection, this is not the most eﬃcient approach to the problem. It turns out that the 2D structure of the output constellation reveals the mixing operator H in a very elegant and straightforward way. To see that, let us start by considering the data constellation of a binary antipodal signal s1 (k) (ﬁg. 12.3a). The constellation actually consists of two points on the real axis: s− = −1 and s+ = 1. Next, consider a linear transformation (1) (1) from R1 to R2 which maps s1 (k) to a vector signal x(1) (k) = [x1 (k), x2 (k)]T : x(1) (k) = h1 s1 (k).

(12.27)

The linear operator h1 = [h11 , h12 ]T is a two-dimensional vector shown in ﬁg. 12.3b. The constellation of x(k) is shown in ﬁg. 12.3c, and it also consists of two points x− = −h1 = s− h1 and x+ = h1 = s+ h1 . Now let us look at shape of the data cloud corresponding to the linear combination of several binary antipodal sources s1 (k), . . . , sn (k). It is instructive to study the shape of this cloud as n increases gradually from n = 2 and upward. The linear mixture of n = 2 sources x(2) (k) = h1 s1 (k) + h2 s2 (k)

(12.28)

has the geometric structure shown in ﬁg. 12.4b, for the mixing vectors h1 , h2 , shown in ﬁg. 12.4a. The data cluster contains four points: x++ = s+ h1 + s+ h2 , x+− = s+ h1 + s− h2 , x−+ = s− h1 + s+ h2 , and x−− = s− h1 + s− h2 . Adding a third source s3 (k) with the mixing vector h3 , the data mixture x(3) (k) = h1 s1 (k) + h2 s2 (k) + h3 s3 (k)

(12.29)

has the constellation shown in ﬁg. 12.5. Now the data cluster contains eight points: x+++ = s+ h1 + s+ h2 + s+ h3 , x++− = s+ h1 + s+ h2 + s− h3 , x+−+ = s+ h1 + s− h2 + s+ h3 , x+−− = s+ h1 + s− h2 + s− h3 , x−++ = s− h1 + s+ h2 + s+ h3 , x−+− = s− h1 + s+ h2 + s− h3 , x−−+ = s− h1 + s− h2 + s+ h3 , and x−−− = s− h1 + s− h2 + s− h3 . By simple inspection of ﬁgures 12.3, 12.4, and 12.5, one can make the following useful observations:

12.2

Finite Alphabet Sources

349

1.5

1

0.5

s−

0

s+

−0.5

−1

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

(a) 1.5

1.5

1

1

h

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−1

−0.5

0

(b)

x+

0.5

1

0.5

1

1.5

2

−1.5

x

−1.5

−1

−0.5

−

0

0.5

1

1.5

(c)

(a) Data constellation of a binary antipodal signal s1 (k). (b) Linear transformation vector h1 . (c) Data constellation of the transformed signal x(1) (k) = h1 s1 (k). Figure 12.3

1. The number of cluster points is 2n , where n is the number of binary sources. 2. The data constellation is a symmetric, self-repetitive ﬁgure. While the symmetry is obvious, the self-repetitive structure can be seen by comparing, for example, ﬁg. 12.5b against ﬁg. 12.4b. The ﬁrst consists of two copies of the latter shifted by the vectors −h3 and h3 . The same is true for ﬁgs. 12.4b and 12.3c except that the shift is by the vectors −h2 and h2 . 3. For every cluster point there exist n copies at the directions h1 or −h1 , and h2 or −h2 , . . . , and hn or −hn . It is even more interesting and, in fact, very useful to study the properties of the convex hull of the data constellation set. By deﬁnition, the convex hull of a set of points in 2D space is the smallest polygon that contains them or, in other words, the bounding polygon for these points. Figures 12.6a–c show the convex hulls H1 , H2 , and H3 for the data constellations corresponding to the mixtures x(1) , x(2) , and x(3) , respectively. Let d be the distance between the two alphabet symbols.

350

Blind Signal Processing Based on Data Geometric Properties

1.5

1.5

1

1

x h

2

h

0.5

++

0.5

1 −+

x 0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−1

−0.5

0

0.5

1

−1.5

1.5

x

+−

x− −

−1.5

−1

−0.5

0

(a)

0.5

1

1.5

(b)

Figure 12.4 (a) Mixing vectors h1 , h2 . (b) Data cluster for the mixture x(2) (k) = h1 s1 (k) + h2 s2 (k) of two binary antipodal sources.

1.5

1.5

1

1

x

h 0.5

2

h

x

0.5

1

0

−0.5

−1

−1

−1.5

−1

−0.5

0

(a)

0.5

1

1.5

−1.5

x

+−+

x

+−−

x− − − x

−1.5

−+−

x

3

−0.5

x −++

0

h

++−

+++

−1.5

−1

−0.5

−−+ 0

0.5

1

1.5

(b)

Figure 12.5 (a) Mixing vectors h1 , h2 , h3 . (b) Data cluster for the mixture x(3) (k) = h1 s1 (k) + h2 s2 (k) + h3 s3 (k) of three binary antipodal sources.

It can be shown that any convex hull H satisﬁes the following properties (for the proof, see Diamantaras, 2002): 1. Every edge e of H is parallel to some mixing vector hi , i ∈ {0, 1, · · · , n}. Also, e has length d hi . For the binary antipodal alphabet Aa , we have d = 2. 2. Every vector hi corresponds to a pair of edges, i.e., it is parallel to two edges ei and ei of equal length d hi . It follows that H has 2n edges. 3. H is symmetric. If the alphabet is symmetric around 0 (e.g., Aa ) then the center of symmetry is the point xO = 0. Otherwise, the center of symmetry is a nonzero ∈ Rm . point xO

12.2

Finite Alphabet Sources

351

1.5

1

0.5

2h

0

1

−0.5

−1

(a)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

1

2h

1

2h

0.5

2

0

2h

−0.5

2

2h

1

−1

(b)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

2h

3

1

2h

0.5

2h

1

2

0

−0.5

2h

2h

2

1

−1

2h

(c)

−1.5

−1.5

−1

−0.5

3 0

0.5

1

1.5

Figure 12.6 Convex hulls for data constellations of mixtures of n binary sources. (a) n = 1, (b) n = 2, (c) n = 3.

352

Blind Signal Processing Based on Data Geometric Properties

These important results show that there is a two-to-one correspondence between the edges of the convex hull and the unknown mixing vectors: there is a pair of edges parallel to each mixing vector, and furthermore, the edges have length equal to d times the length of their corresponding mixing vectors. Thus we easily come to the following procedure for the identifying the hi ’s: Algorithm 12.2 : n binary sources, two observations Step 1: Find the constellation set X of the 2D mixture x(k). Step 2: Compute the convex hull H of X . Step 3: H consists of 2n edge pairs {ei , ei }, ei ei , i = 1, · · · , n. The number of sources is n. Step 4: The mixing vectors are: hi = ei /d, up to an unknown ordering and sign. Of course, the original order and sign of the vectors are irretrievable. As we have seen, this is a general, problem-inherent limitation and it is not speciﬁc to this (or any other) particular method. In fact, the limitation cannot be overcome without additional information regarding the sources or the mixing operators. 12.2.1.3

More Than Two Mixtures

It is not diﬃcult to see that the whole convex hull idea can be extended to the case where m ≥ 3. Again, the edges of the convex hull will be parallel to the mixing vectors hi except that, now, the convex hull lies in Rm . The algorithms for computing the convex hull in m-dimensional spaces are not as simple as the ones for the 2D case. For a comprehensive discussion of this topic see Preparata and Shamos (1985). 12.2.2

Instantaneous Mixtures of M -ary Alphabet Sources

The results of section 12.2.1 can be easily extended to M -ary signals, i.e., signals whose alphabet contains M discrete and equally distributed values. For example, the alphabet A5 = {−1, −1/2, 0, 1/2, 1} contains M = 5 symbols symmetrically distributed around 0. Similar results, as in the binary case, hold here as well. Again, the convex hull directly connects the constellation geometry with the unknown mixing vectors. Let d be the distance between the maximum and minimum symbols in the M -ary alphabet AM d = max{AM } − min{AM }. Also let H be the convex hull of the constellation X of the mixture x(k) = h1 s1 (k) + · · · + hn sn (k). The the following statements are true (see ﬁg. 12.7): 1. The number of cluster points is M n , where n is the number of M -ary sources. 2. The data constellation is a symmetric, self-repetitive ﬁgure.

12.2

Finite Alphabet Sources

353

3. Every edge e of H is parallel to some mixing vector hi , i ∈ {0, 1, · · · , n}, and e has length d hi . 4. Every vector hi corresponds to a pair of edges, i.e., it is parallel to two edges ei and ei of equal length d hi . It follows that H has 2n edges. 5. H is symmetric. For alphabets symmetric around zero the center of symmetry is xO = 0. We may use algorithm 12.2 without modiﬁcations for the solution of the M -ary case as well. 12.2.3

Noisy Data

The analysis of the previous subsections pertains to systems with noiseless outputs. In most applications however, the observation is burdened with noise, either because the system itself is noisy or the receiving device introduces errors in the measurements. The additive noise model is commonly used for describing the observation error: x(k) = Hs(k) + v(k).

(12.30)

Without loss of generality, and for the sake of visualization, we shall focus on the two-output case. An entirely similar discussion holds for the cases n = 1 or n > 2. The vector signal v(k) = [v1 (k), · · · , vn (k)]T , contains the noise components vi (k) for each observed output signal i = 1, · · · , n. The constellation of x is now less crisp since the true centers are surrounded by a cloud of points (ﬁg. 12.8a). The methods presented in sections 12.2.1 and 12.2.2 can still be applied preceded by a clustering process that will estimate the actual centers from the noisy data cloud. Such clustering methods include the ISODATA or K-means algorithm (Duda et al., 2001; Lloyd, 1982; MacQueen, 1967), the EM algorithm (Dempster et al., 1977), the neural gas algorithm (Martinetz et al., 1993), Kohonen’s self-organizing feature maps (SOM) (Kohonen, 1989), RBF neural networks (Moody and Darken, 1989), and many others. For a detailed treatment of clustering methods refer to Theodoridis and Koutroubas (1998). Figure 12.8b shows the estimation of the true centers using the K-means algorithm in a system with three binary inputs and two linear output mixtures with noise power at 15dB. Notice that the estimation errors inside the convex hull do not aﬀect the results. It is only the errors at the boundary that are signiﬁcant. We apply the blind identiﬁcation method discussed earlier in this section using the estimated centers provided by K-means, obtaining the results shown in table 12.1. 12.2.4

Convolutive Mixtures of Binary Sources

The convolutive mixtures of binary sources are described by the output of the MIMO FIR system (eq. 12.1). The blind problems related to such systems are con-

354

Blind Signal Processing Based on Data Geometric Properties

1.5

1

0.5

0

2h

1

−0.5

−1

(a)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

1

2h

0.5

2h

1

2

0

2h

−0.5

2h

2

1

−1

(b)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

2h

3

1

2h

0.5

2h 1

2

0

−0.5

2h

2h 2

1

−1

2h

(c)

−1.5

−1.5

−1

−0.5

3 0

0.5

1

1.5

Convex hulls of mixture constellations from n M -ary sources (M = 5). The source symbols are drawn from the alphabet {−1, −0.5, 0, 0.5, 1}, with maximum distance d = 2. (a) n = 1, (b) n = 2, (c) n = 3.

Figure 12.7

12.2

Finite Alphabet Sources

355

1.5

1.5

1

1

0.5 0.5 0 0 −0.5 −0.5 −1

−1

−1.5

−2 −1.5

−1

−0.5

0

0.5

1

1.5

−1.5 −1.5

−1

−0.5

0

(a)

0.5

1

1.5

(b)

(a) Data constellation for a noisy memoryless linear system with 3 binary inputs and 2 outputs (mixtures). The noise level is 15 dB. Superimposed are the true cluster centers marked with “o”. (b) True cluster centers (o) and estimated cluster centers (x) using the K -means algorithm. Also shown is the true convex hull (solid line) and the estimated convex hull (dashed line).

Figure 12.8

Table 12.1

True and Estimated Mixing Vectors

h1

ˆ1 h

h2

ˆ2 h

h3

ˆ3 h

0.3000 0.5000

0.3017 0.4902

− 0.1000 0.6000

−0.0960 0.6089

− 0.4000 −0.1000

0.3917 0.1009

siderably more diﬃcult than the corresponding instantaneous mixture problems, but at the same time, they are much more important. Convolutive mixing models, for example, can describe multipath and crosstalk phenomena in wireless communications, being in that sense much more realistic than instantaneous models. In this section we shall approach the blind source separation and blind system identiﬁcation problems of MIMO FIR models using the geometric properties of the data constellation. We shall treat, ﬁrst, the simpler single-input single-output (SISO) problem and then continue on to the multi-input single-output (MISO) case. The proper MIMO problem is not explicitly discussed since it can be seen as a multitude of m decoupled MISO problems. 12.2.4.1

Blind SISO Deconvolution as Instantaneous Blind Source Separation

In this subsection we shall use the results of the previous sections to solve the blind SISO identiﬁcation and deconvolution problems. Our approach is to relate any given SISO system with an overdetermined instantaneous mixtures model, hence

356

Blind Signal Processing Based on Data Geometric Properties

the same methods can be applied as in sections 12.2.1. Let us consider a linear, FIR, single-input single-output (SISO) system with a binary antipodal input s(k), x(k) =

L−1

hi s(k − i)

(12.31)

i=0

We shall assume that the impulse response hi , i = 0, · · · , L−1, is real. Let us create a vector sequence x(k) using time-windowing of length m on the output sequence x(k) x(k) = [x(k), · · · , x(k − m + 1)]T .

(12.32)

Then using the system 12.31 we have x(k) = Hs(k), where H is ⎡ h0 ⎢ ⎢ 0 ⎢ H=⎢ ⎢ ⎣ 0

the Toeplitz system matrix h1

···

hL−1

0

···

0

h0

··· .. .

hL−2

hL−1

0 ..

0

0

h0

···

···

.

hL−2

(12.33) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

(12.34)

hL−1

and s(k) = [s(k), s(k − 1), · · · , s(k − m − L + 2)]T .

(12.35)

Now, equation 12.33 describes m linear instantaneous mixtures xi (k) = x(k − i), i = 0, · · · , m − 1, of n sources sj (k) deﬁned as follows: sj (k) = s(k − j + 1),

j = 1, 2, · · · , n = m + L − 1.

Thus, we have successfully transformed the problem into the same form treated in section 12.2.1: xi (k) =

n

hij sj (k),

(12.36)

j=1

where hij is the (i, j)th element of H. Equivalently, we can write x(k) =

n

hj sj (k),

(12.37)

j=1

where the mixing vectors h1 , . . . , hn are the columns of H. Given the above formulation, the results of Section 12.2.1 apply directly to this problem. There are, however, some special points to be noted: 1. For any nontrivial FIR ﬁlter of length L > 1, the number of observations x1 , ..., xm is necessarily less than the number of sources s1 , . . . , sn , since n = m+L−1 > m.

12.2

Finite Alphabet Sources

357

2. The mixing vectors have no arbitrary form. For example, h1 has the form [×, 0, · · · , 0]T and hn has the form [0, · · · , 0, ×]T . 3. The sources are not independent. In fact, any one is a shifted version of any other. Next, we shall give examples for two cases: m = 1; and m = 2. Example 12.3 Time Window of Length m = 1 Suppose that we observe the output x(k) of a SISO ﬁlter h = [−0.4937, −1.1330, 0.7632, 0.1604]T excited by the binary input s(k). Using algorithm 12.1 we shall identify the ﬁlter with the necessary permutation and sign changes so that the estimated taps will be positive and arranged in decreasing order. Thus we shall ˆ = [1.1330, 0.7632, 0.4937, 0.1604]T and so obtain h ˆ 2 sˆ (k) + h ˆ 3 sˆ (k) + h ˆ 4 sˆ (k) ˆ 1 sˆ (k) + h x(k) = h 1 2 3 4 = (−h2 )(−s2 (k)) + h3 s3 (k) + (−h1 )(−s1 (k)) + h4 s4 (k). Obviously, the estimated sources sˆi correspond to the true “sources” si as follows: sˆ1 (k) = −s2 (k) = −s(k − 1), sˆ2 (k) = s3 (k) = s(k − 2), sˆ3 (k) = −s1 (k) = −s(k), sˆ4 (k) = s4 (k) = s(k − 3). Since the signals sˆi are shifted versions of the original source, s(k), it is easy to recover their correct order and relative sign changes by computing for each signal, the time shift with maximum correlation to an arbitrary reference, for example, sˆ1 . ˆ we obtain ±h. Applying the same ordering and sign changes to h, Example 12.4 Time Window of Length m = 2 Consider the same SISO ﬁlter as before and let us use time-windowing of length m = 2 to obtain the vector sequence x(k): x(k) x(k) = x(k − 1) ⎤ ⎡ s(k) ⎥ ⎢ ⎢ s(k − 1) ⎥ ⎥ ⎢ −0.4937 −1.1330 0.7632 0.1604 0 ⎢ ⎥ = ⎢ s(k − 2) ⎥ . ⎥ 0 −0.4937 −1.1330 0.7632 0.1604 ⎢ ⎢ s(k − 3) ⎥ ⎦ ⎣ s(k − 4) Using algorithm 12.2 we estimate the original mixing vectors h1 = [−0.4937, 0]T , h2 = [−1.1330, −0.4937]T , h3 = [0.7632, −1.1330]T , h4 = [0.1604, 0.7632]T , h5 = [0, 0.1604]T , but with an arbitrary order and sign change. The estimated mixing vectors can be put in the correct order by observing that the true system parameters

358

Blind Signal Processing Based on Data Geometric Properties

satisfy the following: h1,1 = h2,2 = −0.4937, h2,1 = h3,2 = −1.1330, h3,1 = h4,2 = 0.7632, h4,1 = h5,2 = 0.1604, h5,1 = h1,2 = 0 . Since the sign of each estimated vector is arbitrary, we compare the absolute valˆ j,2 |, and we change the signs of either h ˆ i,1 | against |h ˆi or h ˆj , as necessary, ues, |h ˆ i,1 = h ˆ j,2 . Once the correct order of the mixing vectors is retrieved we auso that h tomatically obtain the correct ﬁlter impulse response (up to a sign). Subsequently, the system input, s(k), is retrieved using standard (nonblind) deconvolution methods. 12.2.4.2

Blind SISO Identiﬁcation

An alternative approach for identifying the impulse response h = [h0 , · · · , hL−1 ]T of a general SISO system (eq. 12.31) has been proposed by Yellin and Porat (1993). The method is not based on constellation geometry but rather on the properties of the successor values of “equivalent” observations. The source symbols s(k) may be drawn from an M -ary alphabet AM = {±1, · · · , ±(M/2)} (M is even). Before we proceed we need to introduce the concept of equivalence between two observations: Deﬁnition 12.1 Observation Equivalence Two observations x(k) and x(l) are said to be equivalent if the input values that produce them according to 12.31 are identical: s(k − i) = s(l − i), for all i = 0, · · · , L − 1. Note that two equivalent observations are necessarily equal, but the converse may not be true. Indeed, it is possible that two equal observations x(k) = x(l), are produced by two diﬀerent strings of input symbols [s(k), · · · , s(k − L + 1)] = [s(l), · · · , s(l − L + 1)]. Consider four sets of (N + 1) consecutive observations from equation 12.31: Xj = {x(j), x(j + 1), · · · , x(j + N )}, Xk = {x(k), x(k + 1), · · · , x(k + N )}, Xl = {x(l), x(l+1), · · · , x(l+N )}, Xm = {x(m), x(m+1), · · · , x(m+N )}. Further assume that the pairs {x(j), x(k)} and {x(l), x(m)} are equivalent. Deﬁne σjki

=

[s(j + i) − s(k + i)]/2,

σlmi

=

[s(l + i) − s(m + i)]/2;

i = 1, · · · , N

(12.38)

and note that σjki , σlmi ∈ A0M = AM ∪ 0. Let the following conditions be true:

12.2

Finite Alphabet Sources

359

1. σjk1 , σlm1 are nonzero and coprime, i.e., their greatest common divisor is 1; 2. for all α, β ∈ AM

σjk1 α σlm1 = β ⇒ |α| = |σjk1 |, |β| = |σlm1 |;

3. for all α, β ∈ A0M , σjk1 σjki − α , = σlm1 σlmi − β

for all i = 2, · · · , N.

The method starts by identifying the ﬁrst ﬁlter tap h0 up to a sign, and continues by recursively identifying the remaining taps given the previous ones. Begin with the remark that x(j) and x(k) are equivalent, so [s(j), · · · , s(j−L+1)] = [s(k), · · · , s(k− L + 1)]. Then, the successor values of x(j), x(k), can be written as x(j + 1) = h0 s(j + 1) +

L−1

hi s(j + 1 − i),

i=1

x(k + 1) = h0 s(k + 1) +

L−1

hi s(k + 1 − i),

i=1

so, x(j + 1) − x(k + 1) = σjk1 h0 . 2

(12.39)

Similarly, for x(l + 1), x(m + 1): x(l + 1) − x(m + 1) = σlm1 h0 , 2

(12.40)

x(j + 1) − x(k + 1) σjk1 = . x(l + 1) − x(m + 1) σlm1

(12.41)

and so,

By condition 2, the ratio |σjk1 /σlm1 | is produced by a unique enumeratordenominator pair in AM . Thus both values σjk1 and σlm1 can be uniquely identiﬁed, up to a sign, leading to the magnitude estimation of h0 by: |h0 | =

|x(l + 1) − x(m + 1)| |x(j + 1) − x(k + 1)| = . 2|σjk1 | 2|σlm1 |

(12.42)

Without loss of generality, we may assume that h0 > 0, and proceed to the estimation of h1 as follows: Write the second successors of x(l), x(k), as x(j + 2) = h0 s(j + 2) + h1 s(j + 1) +

L−1 i=2

hi s(j + 1 − i),

360

Blind Signal Processing Based on Data Geometric Properties

x(k + 2) = h0 s(k + 2) + h1 s(k + 1) +

L−1

hi s(k + 1 − i),

i=2

hence, x(j + 2) − x(k + 2) = σjk2 h0 + σjk1 h1 . 2

(12.43)

x(l + 2) − x(m + 2) = σlm2 h0 + σlm1 h1 . 2

(12.44)

Similarly,

The pair of equations 12.43 and 12.44 involve three unknowns: σjk2 , σlm2 , h1 . However, it turns out that since the ﬁrst two unknowns come from the discrete set A0M and condition 3 is true, the solution is unique. Indeed, assume there existed (1) (1) (1) (2) (2) (2) two diﬀerent solutions {σjk2 , σlm2 , h1 }, {σjk2 , σlm2 , h1 }. Then by equations 12.43 and 12.44 we have (2)

(1)

(2)

(1)

(12.45)

(2)

(1)

(2)

(1)

(12.46)

(σjk2 − σjk2 )h0 = (h1 − h1 )σjk1 , (σlm2 − σlm2 )h0 = (h1 − h1 )σlm1 . Thus, (2)

(1)

σjk2 − σjk2 σjk1 = (2) , (1) σlm1 σlm2 − σlm2 which is impossible, according to condition 3. Therefore, there exists a unique solution to equations 12.43 and 12.44. From these equations it follows that

x(j + 2) − x(k + 2)

x(l + 2) − x(m + 2) − σlm2 h0 /σlm1 , − σjk2 h0 /σjk1 = h1 = 2 2 so the unique h1 can be obtained by ﬁnding the intersection between the sets x(j+2)−x(k+2) αh0 0 + ; α ∈ A F1 = M 2σjk1 σjk1 . x(l+2)−x(m+2) βh0 0 F2 = + ; β ∈ A M 2σlm1 σlm1 This is computationally trivial since the two sets are ﬁnite with few elements. Inductively, for hi , i > 2, and given the values for h0 , . . . , hi−1 , we form the

12.2

Finite Alphabet Sources

361

“deﬂated” successors i−1

x ¯(j + i + 1) = x(j + i + 1) −

σjk(p+1) hi+p

(12.47)

σjk(p+1) hi+p

(12.48)

σlm(p+1) hi+p

(12.49)

p=1

x ¯(k + i + 1) = x(k + i + 1) −

i−1 p=1

x ¯(l + i + 1) = x(l + i + 1) −

i−1 p=1

x ¯(m + i + 1) = x(m + i + 1) −

i−1

σlm(p+1) hi+p

(12.50)

p=1

and we obtain a set of two equations similar to 12.43 and 12.44: x ¯(j + i + 1) − x ¯(k + i + 1) = σjk(i+1) h0 + σjk1 hi , 2

(12.51)

x ¯(l + i + 1) − x ¯(m + i + 1) = σlm(i+1) h0 + σlm1 hi . 2

(12.52)

which are solved in a similar fashion, producing the unknown tap hi . Thus, the whole approach is summarized in the following algorithm Algorithm 12.3 Yellin and Porat Step 1: Collect T observation measurements. Step 2: Find pairs of equivalent measurements. Estimate h0 according to equation 12.42. Step 3: Estimate h1 using h0 and the pairs of equivalent observations. Step 4: Continue with the estimation of h2 , . . . , hn given the previous estimates. Step 5: Use the estimated impulse response to deconvolve the observation sequence and obtain the system input. Remark The choice of pairs of equivalent observations (step 2 in algorithm 12.3) is far from trivial. The indices j, k, l, m, must satisfy various constraints so that the assumptions of the method are met. First, according to condition A, we must have σjk1 , σlm1 = 0, implying that x(j + 1) = x(k + 1), x(l + 1) = x(m + 1). Second, according to condition C, for all i = 2, · · · , N the ratios σjki /σlmi should not be equal to σjk1 /σlm1 . A thorough discussion on the implementation details is in the original paper (Yellin and Porat, 1993). The method can be easily extended to handle complex input constellations (such as QAM) and/or complex ﬁlter taps. For the special case of i.i.d. input signals it is estimated that a suﬃcient batch size that guarantees E > 2 equivalent pairs of measurements is T = 2.44E 0.61 N M N/2 .

362

Blind Signal Processing Based on Data Geometric Properties

It is diﬃcult to satisfy condition C if the source alphabet is binary (AM = Aa ), because there is a limited choice for the values of σjki , σlmi , which belong to the set A0a = {−1, 0, 1}. 12.2.4.3

MISO Systems: Direct Source Extraction

The blind source extraction directly from the output of a multi-input single-output (MISO) system is treated in Diamantaras and Papadimitriou (2005). This work is an extension of earlier work on SISO systems (Diamantaras and Papadimitriou, 2004a). The key to the approach in both cases is the structure of the successor values of equivalent observations induced by the fact that the sources are binary. Subsequently, we shall present the results for the more general MISO case. Let us consider a multi-input single-output (MISO) model described by the following equation: x(k) =

L−1

hiT s(k − i),

(12.53)

i=0

where hi for i = 0, . . . , L−1, are a set of unknown real n-dimensional mixing vectors or ﬁlter taps. The source vector signal s(k) = [s1 (k), . . . , sn (k)]T is composed of n independent binary antipodal signals: si (k) ∈ Aa . The observations of the mixtures are real-valued scalars. For each k, the vector s(k) can take one of 2n values denoted (n) T

(n)

is the ith row of the matrix B(n) deﬁned by bi , i = 1, . . . , 2n . The vector bi in equation 12.17. Let us extend the concept of observation equivalence, deﬁned before for SISO systems, to MISO systems by simply replacing the scalar inputs with vector inputs. Each observation x(k) is generated by the linear combination of L n-dimensional source vectors, therefore, the observation space X ( x(k) is a discrete set consisting of, at most, 2M elements, M = nL. The cardinality |X | will be less than 2M if and (n) (n) (n) (n) only if there exist two diﬀerent L-tuples {bj0 , · · · , bjL−1 } and {bl0 , · · · , blL−1 }, &L−1 T (n) &L−1 T (n) of binary vectors such that i=0 hi bji = i=0 hi bli . The following avoids this situation: Assumption 12.1 Two observations x(k), x(l), are equivalent if and only if they are equal. Hence, |X | = 2M . In other words, to every observation value r ∈ X corresponds ¯ L−1 (r)} of consecutive source vectors that generates ¯ 0 (r), · · · , b a unique L-tuple {b this observation. No other observation value r ∈ X corresponds to the same L-tuple of binary vectors. For any x(k) = r, we have x(k) =

L−1

¯ i (r), hiT b

(12.54)

i=0

since, by deﬁnition, ¯ i (r) = s(k − i), b

for i = 0, · · · , L − 1.

12.2

Finite Alphabet Sources

363

Now the successor observation, x(k + 1), can be written as x(k + 1) = h0T s(k + 1) + = h0T s(k + 1) +

L−1 i=1 L−1

hiT s(k − (i − 1)) ¯ i−1 (r). hiT b

(12.55)

i=1

Since s(k + 1) is an n-dimensional binary antipodal vector, x(k + 1) can take one of the following 2n possible values: yp (r) = h0T b(n) p +

L−1

¯ i−1 (r), hiT b

p = 1, · · · , 2n .

(12.56)

i=1

Note that the successor values yp (r) do not depend on the speciﬁc time index k but only on the observation value r. Therefore, each observation value r creates a class of &2n (n) n successors Y(r) with cardinality |Y(r)| = 2 . Furthermore, we have p=1 bp = 0, so the mean y¯(r) of the members of Y(r) is n

2 1 yp (r) y¯(r) = n 2 p=1 L−1 2n 1 n ¯ i−1 (r) = n h0T b(n) hiT b p +2 2 p=1 i=1

=

L−1

¯ i−1 (r). hiT b

(12.57)

i=1

Now, let us replace every x(k) = r with the mean y¯(r) to obtain a new sequence x(2) (k) = x(2) (k) =

L−1 i=1 L−1

¯ i−1 (r) hiT b hiT s(k − i + 1).

(12.58)

i=1

The new MISO system 12.58 has the same taps as the original system 12.53 except that it is shorter since h0 is missing. We will say that the system has been deﬂated. An additional but trivial diﬀerence is that the source sequence is time-shifted. Based on the discussion above, the whole ﬁlter- or system-deﬂation method, is summarized as follows: Algorithm 12.4 System Deﬂation Step 1: For every r ∈ X locate the set of time instances K(r) = {k : x(k) = r}. Step 2: Find the successor set Y(r) = {x(k + 1) : k ∈ K(r)}. This set must contain 2n distinct values y1 (r), . . . , y2n (r). &2n Step 3: Compute the mean y¯(r) = 1/2n i=1 yi (r). Step 4: Replace x(k) by y¯(r), for all k ∈ K(r).

364

Blind Signal Processing Based on Data Geometric Properties

Clearly, for this method it is essential that all observation-successor pairs [r, yi (r)], i = 1, · · · , 2n will appear, at least once, in the output sequence x. Applying the deﬂation method L−1 times, the system will be eventually reduced to an multiinput single-output instantaneous problem: T s(k − L + 1). x(L) (k) = hL−1

(12.59)

The BSS problem of the type in 12.59 has been treated in section 12.2.1. The main disadvantage of this method stems from the assumption that the data set must contain every possible observation-successor pair. As the size of the MISO system increases this assumption requires exponentially larger observation data sets. An alternative approach starts by observing that for any r ∈ X the centered successors, (n)

ci = yi (r) − y¯(r) = h0T bi

i = 1, · · · , 2n ,

(12.60)

are independent of r. Thus every observation has the same set of centered successors. We shall refer to the set C = {ci ; i = 1, · · · , 2n } as the centered successor constellation set of system 12.53. C can be easily computed by ﬁrst obtaining Y(r), for any r, and then subtracting the mean y¯(r) from each element yi (r) ∈ Y(r). Note that C is symmetric in the sense that c ∈ C ⇔ −c ∈ C. Now, for every observation value r = x(k) ∈ X we have x(k) = h0T s(k) + (n)

r = h0T bi

+

L−1 l=1 L−1

hlT s(k − l) ¯ l (r), hlT b

(12.61)

some i

(12.62)

l=1

= ci +

L−1

¯ l (r), hlT b

some i

(12.63)

l=1

Furthermore, due to the symmetry of the constellation set, there exists a “dual” observation value rd ∈ X such that rd = −ci +

L−1

¯ l (r) hlT b

(12.64)

l=1

rd = r − 2ci .

(12.65)

Assume that for every observation r ∈ X , there exists a unique index j ∈ {1, · · · , 2n } such that r − 2cj ∈ X . Then the dual value rd can be identiﬁed by testing all r − 2cj , j = 1, · · · , 2n , for membership in the observation space X . Let us now replace x(k) by the average of r, rd , to obtain (2)

x ˜

(k) = (r + r )/2 = r − cj = d

L−1 l=1

¯ l (r). hlT b

(12.66)

12.2

Finite Alphabet Sources

365

Note that bl (r) = s(k − l), so x ˜(2) (k) =

L−1

hlT s(k − l).

(12.67)

l=1

Equation 12.67 describes a new, shortened MISO system, Assumption 12.2 For only one r0 ∈ X , there exist at least 2n ki , i = 1, · · · , 2n ∈ {1, 2, · · · , K} such that x(ki ) = r0 , x(ki + 1) = σi (r0 ), i = 1, · · · , 2n . In addition to that, every possible value of X exists at least once in the data set. Summarizing the above results, our second method for obtaining the deﬂated system 12.67 is described below: Algorithm 12.5 System Deﬂation 2 Step 1: Locate an observation value r0 for which 2n distinct successors σi (r0 ), i = 1, · · · , 2n , exist in the data set. Step 2: Compute the successor constellation set C according to equation 12.60. Step 3: For every observation r = x(k) ﬁnd the (unique) value j for which r − 2cj ∈ X . Step 4: Replace x(k) by r − cj . Again, the L − 1 times repetition of this algorithm will reduce the system into a memoryless one, T s(k), x ˜(L) (k) = hL−1

(12.68)

which can be treated as described in the previous section on MIMO systems. Example 12.5 MISO System Identiﬁcation and Source Separation We shall demonstrate the application of the second method via a speciﬁc example. Assume that we observe the output x(k) of a two-input one-output system (ﬁg. 12.9a). The system has two binary inputs s1 , s2 , convolution length L = 3, and ﬁlter taps h1 = [−0.9024, 1.5464]T , h2 = [−0.6131, 0.7166]T , h3 = [−0.4131, −0.1621]T . The output constellation contains 2nL = 64 clusters: X = {±4.3537, ±4.0295, ±3.5275, · · · }. Already, the ﬁrst value r = −4.3537 has 2n = 4 distinct successors in the output sequence x(k). From those successor values the centered successor constellation set is easily computed to be C = {−2.4488, −0.6440, 0.6440, 2.4488}. After the deﬂation steps 2 and 3 we obtain a new sequence x(2) (k) (ﬁg. 12.9b). Now the output constellation contains 2n(L−1) = 16 clusters: X (2) = {±1.9049, ±1.5807, ±1.0787, · · · } and the centered successor constellation set is C (2) = {−1.3297, −0.1035, 0.1035, 1.3297}. We use this set to obtain a second deﬂated signal x(3) (k) (ﬁg. 12.9c). This signal actually corresponds to an instantaneous mixture of the two sources. The output

366

Blind Signal Processing Based on Data Geometric Properties

5

x

0

−5

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

2

x

(2)1 0 −1 −2

1 0.5

(3)

x

0 −0.5 −1

k Figure 12.9 (Top) Output signal from a two-input-one-output FIR system of length L = 3. The output constellation contains 2nL = 64 distinct clusters. (Middle) First deﬂated signal with 2n(L−1) = 16 clusters. (Bottom) Second deﬂated signal with 2n(L−2) = 4 clusters. The last signal corresponds to an instantaneous mixture

of the two sources. constellation has only four clusters: X (3) = {−0.5752, −0.2510, 0.2510, 0.5752}. We may apply algorithm 12.1 to obtain an estimate of the mixing parameters and of ˆ 3,2 = 0.1621 = −h3,2 . ˆ 3,1 = 0.4131 = −h3,1 , h the input signals as well. We obtain h Subsequently performing the optimization (eq. 12.15) for the estimation of the sources we get perfect reconstruction (except for the sign): sˆ1 (k) = −s1 (k), sˆ2 (k) = −s2 (k).

12.3

Continuous Sources In section 12.2 we exploited the constellation structure of signals generated by linear systems with ﬁnite alphabet inputs. In many applications, however, the range of values of the source data is continuous. In this case the geometrical properties of the signals can still be exploited to derive eﬃcient deterministic blind separation methods provided that the sources are sparse, or the input distribution is bounded, or the number of observations is m = 2.

12.3

Continuous Sources

12.3.1

367

Early Approaches: Two Mixtures, Two Sources

It is possible to generalize the geometric properties of binary signals described in section 12.2.1, when the sources symbols are bounded. We start with the simplest case of two instantaneous mixtures x1 , x2 , and two sources s1 , s2 , (m = n = 2): x(k) = [x1 (k), x2 (k)]T = h1 s1 (k) + h2 s2 (k)

(12.69)

We shall describe two of the earliest and most characteristic methods by Puntonet et al. (1995) and Mansour et al. (2001). The Method of Puntonet et al. (1995) The geometry of mixtures of binary signals bears similarity to the geometry of mixtures of bounded sources. Consider the mixing model 12.69 and let s1 (k), s2 (k) ∈ [−B, B]. The linear operation of equation 12.69 transforms the original square source constellation (ﬁg. 12.10a) into a parallelogram-shaped constellation with edges parallel to the vectors h1 and h2 (ﬁg. 12.10b). The blind identiﬁcation task is then equivalent to ﬁnding the edges of the convex hull of the output constellation. Puntonet et al. (1995) proposed a simple procedure for doing that. This procedure is composed of two steps: Step 1: Locate the outmost corner xO of the parallelogram by ﬁnding the observation with the maximum norm: xO = x(k0 ), k0 = arg maxk { x(k) 2 }. Step 2: Translate the observations x (k) = x(k) − x(k0 ) such that xO becomes the origin, and compute the slopes of the parallelogram by computing the minimum and maximum ratios: rmin = mink (x2 (k)/x1 (k)), rmax = maxk (x2 (k)/x1 (k)). These are the ratios h12 /h11 , h22 /h21 , not necessarily in that order. Once the slopes of the edges are determined, the mixing matrix is estimated by 1 1/rmin ˆ . (12.70) H= rmax 1 Since (rmin , rmax ) = (h12 /h11 ) or (h22 /h21 ) we have, 1 h /h /h 1 h 21 22 11 12 ˆ = or . H h12 /h11 1 h22 /h21 1 Remember now that

H = [h1 , h2 ] =

h11

h21

h12

h22

,

(12.71)

Blind Signal Processing Based on Data Geometric Properties

1.5 1 0.8 1 0.6 0.4

0.5

2

0

x

2

0.2

s

368

0 −0.2

−0.5

−0.4

h

h

2

−0.6

1

−1 −0.8 −1 −1.5

−1.5

−1

−0.5

0

0.5

1

1.5

−1

−0.5

0

s1

0.5

1

x1

(a)

(b)

Figure 12.10 (a) Source constellation for two independent sources uniformly distributed between −1 and 1. (b) Output constellation after a 2×2 linear, memoryless transformation of the sources in panel a.

so, ˆ =H H

1/h11

0

0

1/h22

ˆ =H or H

0

1/h12

1/h21

0

.

ˆ −1 x(k) will be In either case, the source estimate ˆ s(k) = H ˆ s(k) = [h11 s1 (k), h22 s2 (k)]T or ˆ s(k) = [h12 s2 (k), h21 s1 (k)]T .

(12.72)

Thus, the estimated sources will be equal to the true ones except for the usual unspeciﬁed scale and order. Note that the method works even if the source pdf is semibounded, for example, bounded only from below. In that case the parallelogram is open-ended but the visible corner is suﬃcient for identifying the two slopes. The main drawbacks of this approach are two: it cannot generalize to more sources or observations, and it will not work if the source pdf is not bounded (for example, Gaussian, Laplace, etc).

The Method of Mansour et al. (2001) Another simple procedure for the solution of the 2 × 2 instantaneous BSS problem has been proposed by Mansour et al. (2001). The transformation s → x described by equation 12.69 represents a skew, rotation, and scaling of the original axes in 2 dimensions. The ﬁrst step of the procedure is to remove the skew by prewhitening x using the covariance matrix Rx = E{x(k)x(k)T }. If Rx = Lx LTx is the Cholesky factorization of Rx , let z(k) = L−1 x x(k).

(12.73)

12.3

Continuous Sources

369

The mapping x → z is called prewhitening transformation because the output −T = I. The prewhitening vector z(k) is white: Rz = {z(k)z(k)T } = L−1 x Rx Lx transformation (eq. 12.73) makes the axes become orthogonal again, but the rotation and the scaling remains. The next step is to compensate for the rotation by computing the angle θ of the furthermost point of the constellation of z from the origin. We consider two cases: The sources are uniformly distributed, say, between −1 and 1 (ﬁg. 12.11a). The source constellation is a square and the angle θ corresponds to a corner of the square. Therefore, in order to compensate for θ, the corner should return to its original position at π4 . This is achieved by the following orthogonal transformation: cos( π4 − θ) − sin( π4 − θ) y(k) = z(k). (12.74) sin( π4 − θ) cos( π4 − θ) The sources are super-Gaussian, i.e., kurt(si ) = E[s4i ] − 3(E[s2i ])2 > 0, i = 1, 2 (ﬁg. 12.11b). The constellation of s in this case is “pointy” along the directions [±1, 0] and [0, ±1]. The angle θ corresponds to one of the “hands” of the X-shaped constellation for x. Clearly, θ should be reduced to 0. This is done by the following rotation transformation: cos(−θ) − sin(−θ) y(k) = z(k). (12.75) sin(−θ) cos(−θ) In both cases there remains an unknown scaling of the sources which cannot be removed since it is unobservable in all BSS problems. 12.3.2

Sparse Sources, Two Mixtures

Another special case of continuous sources that can be successfully treated using geometric methods is the case of sparse sources. A signal si (k) is sparse if it is equal to zero most of the time. The sparseness of the si is measured by the sparseness probability pS (si ) = Pr{si (k) = 0}. Values of pS closer to 1 correspond to more sparse data, whereas values closer to 0 represent dense data. Consider now the typical instantaneous mixing model: x(k) = Hs(k),

(12.76)

assuming that all the sources are sparse. Then it is highly likely that there exist some time instances such that only one source is active at that instance. If, for example, only si is nonzero at time k, then x(k) is proportional to hi , the ith column of H. The number of outputs m is not important, as long as m ≥ 2. In fact, the number of outputs may even be less than the number of sources (m < n). In the subsequent discussion we shall use the convenient value m = 2 because it will

370

Blind Signal Processing Based on Data Geometric Properties

1.5

1

1 0.5

s2 0.5

x2

0

0

−0.5 −0.5 −1 −1.5 −2

−1

0

1

−1

2

−1

−0.5

s1

2

y2

0

0.5

1

x1

2

π/4

1

z2

0

−1

1

θ

0

−1

−2

−2 −2

0

2

−2

0

y1

(a)

2

z1

50

20 10

s

2

x 0

2

0

−10 −50

−50

0

−20

50

−20

−10

s

1

(b)

10

5

z

0

0 −5

−10

−10 −10

0

y

1

10

20

θ

5

2

−5

−15 −20

20

15

10

2

10

1

15

y

0

x

−15 −20

−10

0

10

20

z

1

The linear, instantaneous transformation s → x introduces skew, rotation, and scaling on the original axes. The whitening transform x → z removes the skew, making the axes orthogonal again. Then the rotation can be removed by an orthogonal transformation z → y. (a) If the source distribution is uniform we must rotate so that θ becomes π/4. (b) If the source distribution is super-Gaussian then we must rotate so that θ becomes 0.

Figure 12.11

Continuous Sources

371

8

8

6

6

6

4

4

4

2

2

2

0

x

0

x

0

2

8

2

x2

12.3

−2

−2

−2

−4

−4

−4

−6

−6

−8 −8

−6

−4

−2

0 x1

2

4

6

8

−6

−8 −8

−6

−4

(a)

−2

0 x1

2

4

6

8

−8 −8

−6

−4

(b)

90

90

90 120

60

4

6

8

60

400

300

2

400

500 120

60 400

300

300

150

150

30

150

30

200

30

200

200

100

100

100

180

0

210

330

240

0 x1

(c)

500 120

−2

300

180

0

210

330

240

300

180

0

210

330

240

300

270

270

270

(d)

(e)

(f)

Output constellation for m = 2 outputs and n = 4 sparse sources. The top three plots correspond to diﬀerent sparseness probabilities (a) pS = 0.6, pS = 0.7, (c) pS = 0.8. The solid lines are the directions of the four vector-columns of H . The three bottom ﬁgures d, e, and f are polar plots of the data density (potential) function with spreading parameter σ = 8 corresponding to the constellations a, b, and c, respectively. Figure 12.12

help us visualize the results. Boﬁll and Zibulevsky (2001) observed that the data are clustered along the directions of the mixing vectors hi , i.e., the columns of H. Figure 12.12 shows the output constellation for the memoryless system (eq. 12.76) with m = 2 outputs, n = 4 sparse inputs, and diﬀerent sparseness levels. As the sparseness of the inputs increases, the four clustering directions become more easily identiﬁable (see ﬁgures 12.12a,b,c). Thus blind system identiﬁcation is achieved by identifying the directions of maximum data density. Assuming that the sources are zero mean, so they can take both positive and negative values, the clustering will extend to the negative directions −hi as well. Since, for each i, both opposing directions hi and −hi are equally probable, it is not possible to identify the “true” vector. This is a manifestation of the sign ambiguity which is inherent to the BSS problem. Not surprisingly, the ordering ambiguity is also present in the sense that there is no predeﬁned order on the directions of maximum data density.

372

Blind Signal Processing Based on Data Geometric Properties

For m = 2, a practical algorithm has been proposed by Boﬁll and Zibulevsky (2001). For any two-dimensional vector x = [x1 , x2 ]T = 0 let us deﬁne the angle of x, θ(x) = arctan(x2 /x1 ).

(12.77)

The directions θ where the random variable θk = θ(x(k)) has the highest density are the directions of the mixing vectors. The density is estimated by the use of a potential function U (θ) : w(k) t(θ − θk ; σ) (12.78) U (θ) = k

' t(α; σ) =

1−

α π/(4σ) ,

0,

for |α|

3 are possible but impractical due to the high computational cost and the extremely large required data sets. In an analogous way to the sparse case, the method is based on the properties of the density (pdf) ρΘ ¯ of the random variable θ¯ = θ(x)

mod π,

where θ(x) is the angle of x deﬁned in equation 12.77. Here, however, the peaks of the density may not have a one-to-one correspondence with the mixing vectors, especially when the number of sources is greater than the number of observations (n > m). The basic result is that the angles θi = θ(hi ),

i = 1, · · · , n

(12.82)

of the mixing vectors hi satisfy the geometric convergence condition (GCC) deﬁned below: Deﬁnition 12.2 Geometric Convergence Condition The set of angles {θ1 , . . . , θn }, θi ∈ [0, π), satisﬁes the GCC if, for each i, θi is the median of ρY restricted to the receptive ﬁeld Φ(θi ). Deﬁnition 12.3 Receptive Field For a set of angles {θ1 , . . . , θn }, θi ∈ [0, π), the receptive ﬁeld Φ(θi ) is the set consisting of the angles θ closest to θi : Φ(θi ) = {θ ∈ [0, π) : |θ − θi | ≤ |θ − θj | for all j = i}. Since the angles of the true mixing vectors satisfy the GCC we hope that we can ﬁnd them by devising an algorithm which converges when the GCC is satisﬁed. This is exactly the aim of the geometric ICA algorithm (Theis et al., 2003a,b). This iterative algorithm works with a set of n unit-length vectors (and their opposites) and terminates only when the angles of these vectors are the medians of their corresponding receptive ﬁelds. It is conjectured that the only stable points of this algorithm are the true mixing vectors. The algorithm starts by picking n random pairs of opposing vectors: {wi (0), wi (0) = −wi (0)}, i = 1, · · · , n. For each iteration k, a new observation vector x(k) is projected onto the unit circle: z(k) =

x(k) . x(k)

Then we locate the vector wj (k) closest to z(k) and we update the pair wj (k),

12.3

Continuous Sources

375

wj (k), as follows: wjtemp

z(k)−w (k)

= wj (k) + η(k) z(k)−wjj (k) ,

wj (k + 1)

= wjtemp / wjtemp ,

wj (k

= −wj (k + 1).

+ 1)

(12.83)

The other w’s are not updated in this iteration. It can be shown that the set W = {w1 (∞), . . . , wn (∞)} is a ﬁxed point of this algorithm if and only if the angles θ(w1 (∞)), . . . , θ(wn (∞)) satisfy the GCC. We already know that the set A = {θ(h1 ), . . . , θ(hn )} satisﬁes the GCC, therefore, we hope that, at convergence, {θ(w1 (∞)), . . . , θ(wn (∞))} = A. If this is true then the vectors w1 (∞), . . . , wn (∞) are parallel to the mixing vectors h1 , . . . , hn , although not necessarily in that order. Since the order and scale are insigniﬁcant, this is not ˆ −1 = [w1 (∞), . . . , wn (∞)]−1 a problem. If m = n, then the estimated matrix H solves the BSS problem. In the overdetermined case (m > n) the general algorithm for the source recovery is the maximization of P (s) under the constraint x = Hs. This linear optimization problem can be approached using various methods, such as, for example, the one described in section 12.3.2. The FastGEO Algorithm An alternative way to ﬁnd the mixing vectors is to design a function which is zero exactly when its arguments satisfy the GCC. Then we simply have to compute the zeros of this function, for example, by exhaustive search. This approach describes the so-called FastGEO algorithm (Jung et al., 2001; Theis et al., 2003a). Let us separate the interval [0, π) into n subintervals with separating boundaries φ1 , . . . , φn , and let θi be the median of θ¯ in the subinterval [φi , φi+1 ], FΘ ¯ (φi ) + FΘ ¯ (φi+1 ) −1 θ i = FΘ , i = 1, · · · , n, (12.84) ¯ 2 ¯ F ¯−1 is the inverse function where FΘ ¯ is the cumulative distribution function of θ, Θ of FΘ ¯ (we assume it exists), and φn+1 = φ1 + π (see ﬁg. 12.14). Then the function T θ 1 + θ2 θn−1 + θn (n) μ (φ1 , · · · , φn−1 ) = − φ2 , · · · , − φn (12.85) 2 2 is zero if and only if θi + θi+1 = φi+1 , 2

i = 1, · · · , n − 1

for all i, and so by deﬁnition the receptive ﬁeld Φ(θi ) is exactly the subinterval [φi , φi+1 ] and θi is the median of its receptive ﬁeld; in other words, the set {θi , . . . , θn } satisﬁes the GCC. For each set of separating boundaries {φ1 , . . . , φn−1 } we compute the medians θ1 , . . . , θn by equation 12.84 and then the function μ(n) (φ1 , · · · , φn−1 ) by equation 12.85. The FastGEO algorithm is the exhaustive search for the zeros of μ(n) .

376

Blind Signal Processing Based on Data Geometric Properties

φ3

1

θ

2

φ

2

θ1 φ1

0.5

θ

3

0

φ

4

−0.5

−1

−1.5

−1

−0.5

0

0.5

1

1.5

The angles θi of the mixing vectors hi satisfy the geometric convergence condition if they are the median of the random variable θ(x) within the interval Φ(θi ) = [φi , φi+1 ]. Φ(θi ) is called the receptive ﬁeld of θi and it is the set consisting of the angles θ closest to θi .

Figure 12.14

Especially for n = 2 we let φ1 = φ and we have φ2 = φ + π/2, so μ(2) (φ) =

θ1 + θ2 − (φ + π/2). 2

Example 12.7 Let x1 , x2 be two instantaneous mixtures of two uniform sources s1 , s2 . The mixtures were generated by the following mixing operator 0.0735 0.2913 H= . −0.3391 0.3725 The distribution of the angle y = φ(x) is shown in ﬁg. 12.15. The same ﬁgure shows the receptive ﬁeld boundaries {φ1 , φ2 , φ3 } = {77.0998, 167.0998, 257.0998} (in degrees), corresponding to the angles {θ1 , θ2 } = {51.9759, 102.2237} of the mixing vectors h1 = [0.0735, −0.3391]T , h2 = [0.2913, 0.3725]T . The angles θ2 and θ1 + 180 are the medians of the angle distribution in the corresponding receptive ﬁelds.

12.4

Conclusions Blind signal processing (BSP) refers to a wide variety of problems where the output of a system is observable but neither the system nor the input is known. The large family of BSP problems includes blind signal separation (BSS), blind system or

Conclusions

377

0.012

φ1

φ2

φ3

0.01

Probability density

12.4

0.008

0.006

0.004

0.002

θ 0

1

0

50

θ

θ +180

2

100

θ +180

1

150

200

2

250

300

350

400

Angle (degrees)

The distribution of the angle θ(x) for two instantaneous mixtures of two uniformly distributed sources. The receptive ﬁeld boundaries are deﬁned by the angles φi . The angles θi of the mixing vectors are the medians of each receptive ﬁeld.

Figure 12.15

channel identiﬁcation (BSI or BCI), and blind deconvolution (BD). Traditional approaches exploit statistical properties of second or higher order. Recently a third approach has emerged using the geometric properties of the data cloud. This approach exploits the ﬁnite alphabet property of the input data or the shape of the constellation depending on the probability density of the sources. In such an approach our basic tools are methods for data clustering and shape description such as the convex hull. The advantage of the geometric approach is the ﬁnite nature of the methodology following the clustering step. Typically, this methodology is fast for small problem sizes, i.e., for few sources or short channels. The main disadvantage is the combinatorial explosion which is incurred when the problem size grows large. To combat this drawback, channel-shortening methods may come to our assistance. The problem, however, is far from solved and many issues remain open. In this chapter we presented the main geometric principles used in blind signal processing. We presented a comprehensive literature survey of geometric methods and we outlined the basic methods for blind source separation, blind deconvolution, and blind channel identiﬁcation.

13

Game-Theoretic Learning

Geoﬀrey J. Gordon Whatever games are played with us, we must play no games with ourselves. —Ralph Waldo Emerson

one-step vs. sequential

(im)perfect information

A game is a description of an environment where multiple decision makers can interact with one another. Each of the decision makers, called a player, may have its own goals; these goals may align with the goals of other players, conﬂict with them, or some combination of the two. Some traditional examples of games are bridge, blackjack, chess, roulette, and poker. Less traditional examples include auctions, marketing campaigns, decisions about where to build a new factory, and various types of social interactions such as applying for a job. Finally, many popular games mix components of perception and physical skill with the problem of making good decisions; examples include football, freeze tag, paintball, and driving in traﬃc. This chapter is about how to learn to play a game. We will discuss how a player can, by repeated interaction with its environment and with the other players, discover how to make decisions which achieve its goals as reliably as possible. Playing a real-life game such as football is far beyond the capability of any current artiﬁcial learning system, but we will at least begin to address the issues of exploration and generalization which arise in such a problem. The diﬃculty of making good decisions in a game can range from trivial to nearly impossible. We can classify games according to several dimensions; each of these classiﬁcations aﬀects the type of solution we can seek, the algorithms we can use, and the diﬃculty of ﬁnding a good plan of action. In one-step games each player decides on its strategy all at once. In sequential games a player commits to its actions in several steps, and after each step it may ﬁnd out something about the other players’ choices. Of course, to learn about any game the players will usually need to play it several times; so, we can speak of a repeated game, either one-step or sequential. In perfect information games, each player knows all the choices that the other players have already made. In games with imperfect information, the players know

380

Game-Theoretic Learning

(in)complete information

13.1

only some of the past choices of other players. For example, if an auction house is selling several copies of the same item one after another via sealed bids, the bidders will know the sale prices of the previous items but will not know the details of the previous bids. In complete information games the players know all the details of the game they are playing: they know the structure of the game, the outcomes of all past external events that are relevant to their future payoﬀs, and what the payoﬀs are in any situation for themselves and for the other players. In incomplete information games some of the players are missing some of this information; for example, in bridge or poker the players don’t see each other’s cards. In this chapter we will discuss all of these diﬀerent types of games in turn. Each one presents diﬀerent diﬃculties, so we will discuss various algorithms for learning to play them. Each of the algorithms provides diﬀerent performance guarantees, so we will describe and compare the types of guarantees that are available. We will start with one-step games. The standard representation of one-step games is the normal form, described in section 13.1. Given the normal form, classical game theory looks for distributions of play from which no player can alter its actions to improve its payoﬀs; such distributions are called equilibria, and we will discuss them in section 13.2. Equilibria are possible outcomes of learning, since learning players will not be satisﬁed as long as they think they can improve their reward. So, in section 13.3, we will review learning algorithms for one-step games and use the diﬀerent types of equilibria to describe what happens when various learning algorithms play against one another. From one-step games we will move to sequential games. In sequential games we can deﬁne additional types of equilibria, and we need to move to more complicated learning algorithms. Sections 13.4 and 13.5 cover these new equilibria and learning algorithms. Finally, we will conclude in section 13.6 with some examples of how game-theoretic learning algorithms have been applied to solve real-world problems. These problems range from poker to robotic soccer.

Normal-Form Games Any game can in principle be described with the following information:

Normal form representation

A list of the players. We will assume that there are only ﬁnitely many of them. For each player, a list of the actions (also called plays or pure strategies) that it may choose. We will assume that there are ﬁnitely many pure strategies. Any probability distribution over pure strategies is called a mixed strategy. Given a strategy proﬁle (that is, a pure strategy for every player), the utility or payoﬀ which each player assigns to the resulting outcome. If there are external random events which aﬀect utility, we only need to know the expected utility of each strategy proﬁle.

13.1

Normal-Form Games

381

This representation is called the normal form of the game. As with any general representation, the normal form may be nowhere near the most concise way to describe a game. Still, it does allow us to discuss many diﬀerent games and algorithms in a general way. We can represent a normal-form game with a table: there is one entry in the table for each strategy proﬁle, and the entry is a vector which lists the utility of the resulting outcome for each player. For example, the following table represents the children’s game rock-paper-scissors:

common knowledge

R

P

S

R

0, 0

−1, 1

1, −1

P

1, −1

0, 0

−1, 1

S

−1, 1

1, −1

0, 0

The ﬁrst entry on the second row of this table says that, if the row player chooses P (for “paper”) while the column player chooses R (for “rock”), then the row player gets a payoﬀ of 1 while the column player gets a payoﬀ of −1. The payoﬀs can in general be random variables, but we are only interested in their expected values, so we will not bother to write out any other properties of their distributions. The payoﬀ table is assumed to be common knowledge. That is, every player knows it, every player knows that every player knows it, and so forth. If the payoﬀ table is not common knowledge (that is, if some players have information about it that others don’t), the game is called a Bayesian game; section 13.4.1 covers Bayesian games in more detail. The simplest type of game to reason about is a two-player constant-sum game. Constant sum means that, for each strategy proﬁle (i.e., for each entry in the table), the sum of the payoﬀs to the two players is constant. For example, rock-paperscissors is a constant-sum game, since each utility vector sums to zero. Constantsum normal-form games are one of the few types of game with a universally accepted and easy to compute solution concept (the minimax equilibrium; see section 13.2.1). If the payoﬀs do not sum to a constant, or if there are more than two players, the game is called general sum. By convention a game with three or more players is always called general sum, even if the payoﬀs do sum to a constant, since minimax equilibrium doesn’t make sense for multiplayer games. Environments in which all players have the same payoﬀs are called cooperative or team games. Team games may appear easy, but they can be diﬃcult to solve because of imperfect or incomplete information: while a player may believe that a particular strategy proﬁle is best, it may not be able to trust the other players to agree. So, it may have to choose an action which appears suboptimal in order to try to reach a diﬀerent but safer outcome.

382

13.2

Game-Theoretic Learning

Equilibrium

common knowledge of rationality

types of equilibrium

safety value

An equilibrium of a game is a self-reinforcing distribution over strategy proﬁles. That is, if it is common knowledge that all players are acting according to a given equilibrium, no one player wants to change how it plays. There are many diﬀerent types of equilibrium, which diﬀer in how they formalize the above deﬁnition. Classical game theory takes the view that the best way to analyze a game is to determine what its equilibria are. We can justify this view if we assume that the game is common knowledge among the players, and that the players have common knowledge of each other’s rationality. However, equilibria don’t tell us everything if the players have limited computation (often called bounded rationality) or if some players disagree about the rules of the game. Also, a single game may have many equilibria, and it may not be clear how the players should (or can) select one. In this chapter we take a slightly diﬀerent view: we are more concerned with how a player may adapt its actions based on information about what the other players are doing. So, we will write down learning algorithms and analyze what happens when the players use these algorithms in diﬀerent types of games. Still, the various ideas of equilibrium are important: for example, under appropriate circumstances some of the learning algorithms we describe below will converge toward various types of equilibrium play. We have already mentioned one type of equilibrium, the minimax or von Neumann equilibrium for constant-sum matrix games. For general-sum games, there are at least two important types of equilibrium: the Nash equilibrium and the correlated equilibrium. And we will see even more types of equilibrium when we discuss sequential decision making in section 13.4 below. In addition to the Nash and correlated equilibria, it is sometimes helpful to know the safety value of each player in the game. A player’s safety value is the best payoﬀ that it can guarantee itself no matter what the other players do. That is, even if the other players irrationally ignore their own payoﬀs, they cannot force the ﬁrst player to accept less than its safety value. In any equilibrium, each player must have a payoﬀ at least as high as its safety value: if it did not, it would switch to its safety strategy. 13.2.1

Minimax Equilibrium

In a minimax equilibrium the players are required to choose independent probability distributions over their strategies, say x for the row player and y for the column player. If the payoﬀ to the row player is r(x, y), then the minimax value of the game for the row player is min max r(x, y) y

x

(13.1)

and a minimax strategy for the row player is any value of x which achieves the maximum in 13.1. Neither player will wish to deviate from its set of minimax

13.2

Equilibrium

383

strategies, since any deviation will give the other player a strategy which gets strictly better than the minimax payoﬀ. Any constant-sum matrix game has at least one minimax equilibrium. The set of these minimax equilibria is convex, all minimax equilibria have the same value, and miny maxx r(x, y) = maxx miny r(x, y). In the game of rock-paper-scissors described above, there is exactly one minimax equilibrium: both players play R, P , and S with equal probability. By taking the expectation over the diﬀerent possible outcomes (each of the nine proﬁles RR, RP , RS, . . . has probability 1/9) we can see that the average payoﬀ is zero for both players. On the other hand, if one of the players deviates from the equilibrium, the other player has a response that nets better payoﬀ: for example, if the row player picks R with probability 1/2 and P and S with probability 1/4 each, then the column player can choose P all the time. The column player will then get an expected payoﬀ of (1/2)1 + (1/4)0 + (1/4)(−1) = 1/4 > 0. 13.2.2

Nash equilibrium

In general-sum games the idea of minimax equilibrium no longer makes sense, so we must seek alternative types of equilibrium. Perhaps the best-known type of equilibrium for general-sum games is the Nash equilibrium. In a Nash equilibrium we require the players to choose independent distributions over strategies. So, a Nash equilibrium is a proﬁle of strategy distributions such that, if we hold the distributions ﬁxed for every player except one, the remaining player can get no beneﬁt by changing its play. There is always at least one Nash equilibrium for every game, and there may be many. In a constant-sum game, Nash equilibria are the same as minimax equilibria. But in a general-sum game, a player’s payoﬀ may diﬀer greatly from one Nash equilibrium to another, and the set of Nash equilibria may be nonconvex and diﬃcult to compute. To illustrate Nash equilibria, consider the game of “Battle of the Sexes”. In this game, a husband and wife want to decide whether to go to the opera (O) or the football game (F). One of them (the row player) prefers opera, while the other (the column player) prefers football. But, they also prefer to be together; so, they have the following payoﬀs: O

F

O

4, 3

0, 0 .

F

0, 0

3, 4

This game has three Nash equilibria. Two of them are deterministic: both players go to the opera, or both go to football. The last one is mixed: the row player picks opera 3/7 of the time, while the column player picks opera 4/7 of the time.

384

Game-Theoretic Learning

(In this mixed strategy, each player’s distribution makes the other player perfectly indiﬀerent about whether to pick opera or football.) 13.2.3

Correlated Equilibrium

In real life many people would solve the Battle of the Sexes by ﬂipping a coin to decide which event to go to. This strategy is not a Nash equilibrium: in a Nash equilibrium the players are not allowed to communicate before the game, so they cannot both see the same coin ﬂip. Instead it is a correlated equilibrium, which is like a Nash equilibrium except that we drop the requirement of independence between the players’ distributions over strategies. More formally, consider a distribution P over the set of strategy proﬁles; P may contain arbitrary correlations between the strategies of the diﬀerent players. Some external mechanism, which we will call the moderator, selects a strategy proﬁle x according to P and reports to player i the action xi that it is supposed to follow. P is a correlated equilibrium if player i has no incentive to play anything other than xi (even after ﬁnding out that xi was recommended, which may tell it something about the other players’ strategies). The coin-ﬂip strategy is a correlated equilibrium in which the distribution P places weight 1/2 on each of the proﬁles OO and F F . Another everyday example of a correlated equilibrium is a traﬃc light: we can model the light as being randomly red or green as we approach it.1 Red is the moderator’s recommendation to stop, while green means to go through the intersection without stopping. Given a red light it is not worth going through the intersection and risking a crash, while with a green light we can assume that the traﬃc on the cross street will stop and our best strategy is to maintain speed. 13.2.4

Equilibria in Battle of the Sexes

To gain more intuition for Nash and correlated equilibria we will illustrate how to compute them for the Battle of the Sexes. We will start by computing the correlated equilibria, which satisfy a set of linear equality and inequality constraints; we will then obtain the Nash equilibria by adding in some nonlinear constraints. We can describe a correlated equilibrium in Battle of the Sexes with numbers a, b, c, and d representing the probability of the four strategy proﬁles OO, OF , F O, and F F : O

F

O

a

b

F

c

d

Suppose that the row player receives the recommendation O. Then it knows that the a b and a+b . (The denominator column player will play O and F with probabilities a+b is nonzero since the row player has received the recommendation O.) The deﬁnition

13.2

Equilibrium

385

FF

FO OF OO Equilibria in the Battle of the Sexes. The corners of the outlined simplex correspond to the four pure strategy proﬁles OO, OF , F O, and F F ; the curved surface is the set of distributions where the row and colu

9/18/06

6:24 AM

Page 1

Simon Haykin is University Professor and Director of the Adaptive Systems Laboratory at McMaster University. José C. Príncipe is Distinguished Professor of Electrical and Biomedical Engineering at the University of Florida, Gainesville, where he is BellSouth Professor and Founder and Director of the Computational NeuroEngineering Laboratory. Terrence J. Sejnowski is Francis Crick Professor, Director of the Computational Neurobiology Laboratory, and a Howard Hughes Medical Institute Investigator at the Salk Institute for Biological Studies and Professor of Biology at the University of California, San Diego. John McWhirter is Senior Fellow at QinetiQ Ltd., Malvern, Associate Professor at the Cardiff School of Engineering, and Honorary Visiting Professor at Queen’s University, Belfast.

OF RELATED INTEREST

Probabilistic Models of the Brain PERCEPTION AND NEURAL FUNCTION

edited by Rajesh P. N. Rao, Bruno A. Olshausen, and Michael S. Lewicki The topics covered include Bayesian and information-theoretic models of perception, probabilistic theories of neural coding and spike timing, computational models of lateral and cortico-cortical feedback connections, and the development of receptive field properties from natural signals. Theoretical Neuroscience COMPUTATIONAL AND MATHEMATICAL MODELING OF NEURAL SYSTEMS

Peter Dayan and L. F. Abbott Theoretical neuroscience provides a quantitative basis for describing what nervous systems do, determining how they function, and uncovering the general principles by which they operate. This text introduces the basic mathematical and computational methods of theoretical neuroscience and presents applications in a variety of areas including vision, sensory-motor integration, development, learning, and memory.

The MIT Press Massachusetts Institute of Technology Cambridge, Massachusetts 02142 http://mitpress.mit.edu 0-262-08348-5 978-0-262-08348-5

Neural Information Processing series

New Directions in Statistical Signal Processing Haykin, Príncipe, Sejnowski, and McWhirter, editors

COMPUTER SCIENCE /COMPUTATIONAL NEUROSCIENCE /STATISTICS

New Directions in Statistical Signal Processing FROM SYSTEMS TO BRAINS

New Directions in Statistical Signal Processing FROM SYSTEMS TO BRAINS edited by Simon Haykin, José C. Príncipe, Terrence J. Sejnowski, and John McWhirter

edited by Simon Haykin, José C. Príncipe, Terrence J. Sejnowski, and John McWhirter

Signal processing and neural computation have separately and significantly influenced many disciplines, but the cross-fertilization of the two fields has begun only recently. Research now shows that each has much to teach the other, as we see highly sophisticated kinds of signal processing and elaborate hierarchical levels of neural computation performed side by side in the brain. In New Directions in Statistical Signal Processing, leading researchers from both signal processing and neural computation present new work that aims to promote interaction between the two disciplines. The book’s 14 chapters, almost evenly divided between signal processing and neural computation, begin with the brain and move on to communication, signal processing, and learning systems. They examine such topics as how computational models help us understand the brain’s information processing, how an intelligent machine could solve the “cocktail party problem” with “active audition” in a noisy environment, graphical and network structure modeling approaches, uncertainty in network communications, the geometric approach to blind signal processing, game-theoretic learning algorithms, and observable operator models (OOMs) as an alternative to hidden Markov models (HMMs).

New Directions in Statistical Signal Processing

Neural Information Processing Series Michael I. Jordan and Thomas Dietterich, editors Advances in Large Margin Classiﬁers Alexander J. Smola, Peter L. Bartlett, Bernhard Sch¨ olkopf, and Dale Schuurmans, eds., 2000 Advanced Mean Field Methods: Theory and Practice Manfred Opper and David Saad, eds., 2001 Probabilistic Models of the Brain: Perception and Neural Function Rajesh P. N. Rao, Bruno A. Olshausen, and Michael S. Lewicki, eds., 2002 Exploratory Analysis and Data Modeling in Functional Neuroimaging Friedrich T. Sommer and Andrzej Wichert, eds., 2003 Advances in Minimum Description Length: Theory and Applications Peter D. Grunwald, In Jae Myung, and Mark A. Pitt, eds., 2005 New Directions in Statistical Signal Processing: From Systems to Brain Simon Haykin, Jos´e C. Pr´ıncipe, Terrence J. Sejnowski, and John McWhirter, eds., 2006 Nearest-Neighbor Methods in Learning and Vision: Theory and Practice Gregory Shakhnarovich, Piotr Indyk, and Trevor Darrell, eds., 2006 New Directions in Statistical Signal Processing: From Systems to Brain Simon Haykin, Jos´e C. Pr´ıncipe, Terrence J. Sejnowski, and John McWhirter, eds., 2007

New Directions in Statistical Signal Processing: From Systems to Brain

edited by Simon Haykin Jos´e C. Pr´ıncipe Terrence J. Sejnowski John McWhirter

The MIT Press Cambridge, Massachusetts London, England

c 2007 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. Printed and bound in the United States of America Library of Congress Cataloging-in-Publication Data New directions in statistical signal processing; from systems to brain / edited by Simon Haykin ... [et al.]. p. cm. (Neural information processing series) Includes bibliographical references and index. ISBN10: 0-262-08348-5 (alk. paper) ISBN13: 978-0-262-08348-5 1. Neural networks (Neurobiology) 2. Neural networks (Computer science) 3. Signal processing—Statistical methods. 4. Neural computers. I. Haykin, Simon S., 1931 – II. Series. QP363.3.N52 2006 612.8’2—dc22 2005056210 10 9 8 7 6 5 4 3 2 1

Contents

Series Foreword Preface 1 Modeling the Mind: From Circuits to Systems Suzanna Becker

vii ix 1

2 Empirical Statistics and Stochastic Models for Visual Signals David Mumford

23

3 The Machine Cocktail Party Problem Simon Haykin and Zhe Chen

51

4 Sensor Adaptive Signal Processing of Biological Nanotubes (Ion Channels) at Macroscopic and Nano Scales Vikram Krishnamurthy

77

5 Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging Timothy R. Field

119

6 What Makes a Dynamical System Computationally Powerful? Robert Legenstein and Wolfgang Maass

127

7 A Variational Principle for Graphical Models Martin J. Wainwright and Michael I. Jordan

155

8 Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks Hans-Georg Zimmermann, Ralph Grothmann, Anton Maximilian Sch¨ afer, and Christoph Tietz 9 Diversity in Communication: From Source Coding to Wireless Networks Suhas N. Diggavi

203

243

vi

Contents

10 Designing Patterns for Easy Recognition: Information Transmission with Low-Density Parity-Check Codes Frank R. Kschischang and Masoud Ardakani

287

11 Turbo Processing Claude Berrou, Charlotte Langlais, and Fabrice Seguin

307

12 Blind Signal Processing Based on Data Geometric Properties Konstantinos Diamantaras

337

13 Game-Theoretic Learning Geoﬀrey J. Gordon

379

14 Learning Observable Operator Models via the Eﬃcient Sharpening Algorithm Herbert Jaeger, Mingjie Zhao, Klaus Kretzschmar, Tobias Oberstein, Dan Popovici, and Andreas Kolling

417

References

465

Contributors

509

Index

513

Series Foreword

The yearly Neural Information Processing Systems (NIPS) workshops bring together scientists with broadly varying backgrounds in statistics, mathematics, computer science, physics, electrical engineering, neuroscience, and cognitive science, uniﬁed by a common desire to develop novel computational and statistical strategies for information processing, and to understand the mechanisms for information processing in the brain. As opposed to conferences, these workshops maintain a ﬂexible format that both allows and encourages the presentation and discussion of work in progress, and thus serves as an incubator for the development of important new ideas in this rapidly evolving ﬁeld. The series editors, in consultation with workshop organizers and members of the NIPS Foundation Board, select speciﬁc workshop topics on the basis of scientiﬁc excellence, intellectual breadth, and technical impact. Collections of papers chosen and edited by the organizers of speciﬁc workshops are built around pedagogical introductory chapters, while research monographs provide comprehensive descriptions of workshop-related topics, to create a series of books that provides a timely, authoritative account of the latest developments in the exciting ﬁeld of neural computation. Michael I. Jordan Thomas Dietterich

Preface

In the course of some 60 to 65 years, going back to the 1940s, signal processing and neural computation have evolved into two highly pervasive disciplines. In their own individual ways, they have signiﬁcantly inﬂuenced many other disciplines. What is perhaps surprising to see, however, is the fact that the cross-fertilization between signal processing and neural computation is still very much in its infancy. We only need to look at the brain and be amazed by the highly sophisticated kinds of signal processing and elaborate hierarchical levels of neural computation, which are performed side by side and with relative ease. If there is one important lesson that the brain teaches us, it is summed up here: There is much that signal processing can learn from neural computation, and vice versa. It is with this aim in mind that in October 2003 we organized a one-week workshop on “Statistical Signal Processing: New Directions in the Twentieth Century,” which was held at the Fairmont Lake Louise Hotel, Lake Louise, Alberta. To fulﬁll that aim, we invited some leading researchers from around the world in the two disciplines, signal processing and neural computation, in order to encourage interaction and cross-fertilization between them. Needless to say, the workshop was highly successful. One of the most satisfying outcomes of the Lake Louise Workshop is that it has led to the writing of this new book. The book consists of 14 chapters, divided almost equally between signal processing and neural computation. To emphasize, in some sense, the spirit of the above-mentioned lesson, the book is entitled New Directions in Statistical Signal Processing: From Systems to Brain. It is our sincere hope that in some measurable way, the book will prove helpful in realizing the original aim that we set out for the Lake Louise Workshop. Finally, we wish to thank Dr. Zhe Chen, who had spent tremendous eﬀorts and time in LATEX editing and proofreading during the preparation and ﬁnal production of the book. Simon Haykin Jos´e C. Pr´ıncipe Terrence J. Sejnowski John McWhirter

1

Modeling the Mind: From Circuits to Systems

Suzanna Becker

Computational models are having an increasing impact on neuroscience, by shedding light on the neuronal mechanisms underlying information processing in the brain. In this chapter, we review the contribution of computational models to our understanding of how the brain represents and processes information at three broad levels: (1) sensory coding and perceptual processing, (2) high-level memory systems, and (3) representations that guide actions. So far, computational models have had the greatest impact at the earliest stages of information processing, by modeling the brain as a communication channel and applying concepts from information theory. Generally, these models assume that the goal of sensory coding is to map the high-dimensional sensory signal into a (usually lower-dimensional) code that is optimal with respect to some measure of information transmission. Four informationtheoretic coding principles will be considered here, each of which can be used to derive unsupervised learning rules, and has been applied to model multiple levels of cortical organization. Moving beyond perceptual processing to high-level memory processes, the hippocampal system in the medial temporal lobe (MTL) is a key structure for representing complex conﬁgurations or episodes in long-term memory. In the hippocampal region, the brain may use very diﬀerent optimization principles aimed at the memorization of complex events or spatiotemporal episodes, and subsequent reconstruction of details of these episodic memories. Here, rather than recoding the incoming signals in a way that abstracts away unnecessary details, the goal is to memorize the incoming signal as accurately as possible in a single learning trial. Most eﬀorts at understanding hippocampal function through computational modeling have focused on sub-regions within the hippocampal circuit such as the CA3 or CA1 regions, using “oﬀ-the-shelf” learning algorithms such as competitive learning or Hebbian pattern association. More recently, Becker proposed a global optimization principle for learning within this brain region. Based on the goal of accurate input reconstruction, combined with neuroanatomical constraints, this leads to simple, biologically plausible learning rules for all regions within the hippocampal circuit. The model exhibits the key features of an episodic memory

2

Modeling the Mind: From Circuits to Systems

system: the capacity to store a large number of distinct, complex episodes, and to recall a complete episode from a minimal cue, and associate items across time, under extremely high plasticity conditions. Finally, moving beyond the static representation of information, we must consider the brain not simply a passive recipient of information, but as a complex, dynamical system, with internal goals and the ability to select actions based on environmental feedback. Ultimately, models based on the broad goals of prediction and control, using reinforcement-driven learning algorithms, may be the best candidates for characterizing the representations that guide motor actions. Several examples of models are described that begin to address the problem of how we learn representations that can guide our actions in a complex environment.

1.1

Introduction How does the brain process, represent, and act on sensory signals? Through the use of computational models, we are beginning to understand how neural circuits perform these remarkably complex information-processing tasks. Psychological and neurobiological studies have identiﬁed at least three distinct long-term memory systems in the brain: (1) the perceptual/semantic memory system in the neocortex learns gradually to represent the salient features of the environment; (2) The episodic memory system in the medial temporal lobe learns rapidly to encode complex events, rich in detail, characterizing a particular episode in a particular place and time; (3) the procedural memory system, encompassing numerous cortical and subcortical structures, learns sensory-motor mappings. In this chapter, we consider several major developments in computational modeling that shed light on how the brain learns to represent information at three broad levels, reﬂecting these three forms of memory: (1) sensory coding, (2) episodic memory, and (3) representations that guide actions. Rather than providing a comprehensive review of all models in these areas, our goal is to highlight some of the key developments in the ﬁeld, and to point to the most promising directions for future work.

1.2

Sensory Coding At the earliest stages of sensory processing in the cortex, quite a lot is known about the neural coding of information, from Hubel and Wiesel’s classic ﬁndings of orientation-selective neurons in primary visual cortex (Hubel and Wiesel, 1968) to more recent studies of spatiotemporal receptive ﬁelds in visual cortex (DeAngelis et al., 1993) and spectrotemporal receptive ﬁelds in auditory cortex (Calhoun and Schreiner, 1998; Kowalski et al., 1996). Given the abundance of electrophysiological data to constrain the development of computational models, it is not surprising that most models of learning and memory have focused on the early stages of sen-

1.2

Sensory Coding

3

sory coding. One approach to modeling sensory coding is to hand-design ﬁlters, such as the Gabor or diﬀerence-of-Gaussians ﬁlter, so as to match experimentally observed receptive ﬁelds. However, this approach has limited applicability beyond the very earliest stages of sensory processing for which receptive ﬁelds have been reasonably well mapped out. A more promising approach is to try to understand the developmental processes that generated the observed data. Note that these could include both learning and evolutionary factors, but here our focus is restricted to potential learning mechanisms. The goal is then to discover the general underlying principles that cause sensory systems to self-organize their receptive ﬁelds. Once these principles have been uncovered, they can be used to derive models of learning. One can then simulate the developmental process by exposing the model to typical sensory input and comparing the results to experimental observations. More important, one can simulate neuronal functions that might not have been conceived by experimentalists, and thereby generate novel experimental predictions. Several classes of computational models have been inﬂuential in guiding current thinking about self-organization in sensory systems. These models share the general feature of modeling the brain as a communication channel and applying concepts from information theory. The underlying assumption of these models is that the goal of sensory coding is to map the high-dimensional sensory signal into another (usually lower-dimensional) code that is somehow optimal with respect to information content. Four information-theoretic coding principles will be considered here: (1) Linsker’s Infomax principle, (2) Barlow’s redundancy reduction principle, 3) Becker and Hinton’s Imax principle, and (4) Risannen’s minimum description length (MDL) principle. Each of these principles has been used to derive models of learning and has inspired further research into related models at multiple stages of information processing. 1.2.1

Infomax principle

Linsker’s Infomax Principle

How should neurons respond to the sensory signal, given that it is noisy, highdimensional and highly redundant? Is there a more convenient form in which to encode signals so that we can make more sense of the relevant information and take appropriate actions? In the human visual system, for example, there are hundreds of millions of photoreceptors converging onto about two million optic nerve ﬁbers. By what principle does the brain decide what information to discard and what to preserve? Linsker proposed a model of self-organization in sensory systems based on the Infomax principle: Each neuron adjusts its connection strengths or weights so as to maximize the amount of Shannon information in the neural code that is conveyed about the sensory input (Linsker, 1988). In other words, the Infomax principle dictates that neurons should maximize the amount of mutual information between their input x and output y: Ix;y = ln [p(x|y)/p(x)]

4

Modeling the Mind: From Circuits to Systems

Environmental input

Linsker’s multilayer architecture for learning center-surround and oriented receptive ﬁelds. Higher layers learned progressively more ‘Mexican-hatlike’ receptive ﬁelds. The inputs consisted of uncorrelated noise, and in each layer, center-surround receptive ﬁelds evolved with progressively greater contrast between center and surround.

Figure 1.1

Assuming that the input consists of a multidimensional Gaussian signal with additive, independent Gaussian noise with variance V (n), for a single neuron whose output y is a linear function of its inputs and connection weights w, the mutual information is the log of the signal-to-noise ratio: Ix;y =

independent components analysis

1 V (y) ln 2 V (n)

Linsker showed that a simple, Hebb-like weight update rule approximately maximizes this information measure. The center-surround receptive ﬁeld (with either an on-center and oﬀ-surround or oﬀ-center and on-surround spatial pattern of connection strengths) is characteristic of neurons in the earliest stages of the visual pathways including the retina and lateral geniculate nucleus (LGN) of the thalamus. Surprisingly, Linsker’s simulations using purely uncorrelated random inputs, and a multi-layer circuit as shown in ﬁg 1.1, showed that neurons in successive layers developed progressively more “Mexican-hat” shaped receptive ﬁelds (Linsker, 1986a,b,c), reminiscent of the center-surround receptive ﬁelds seen in the visual system. In further developments of the model, using a two-dimensional sheet of neurons with local-neighbor lateral connections, Linsker (1989) showed that the model selforganized topographic maps with oriented receptive ﬁelds, such that nearby units on the map developed similarly oriented receptive ﬁelds. This organization is a good ﬁrst approximation to that of the primary visual cortex. The Infomax principle has been highly inﬂuential in the study of neural coding, going well beyond Linsker’s pioneering work in the linear case. One of the major developments in this ﬁeld is Bell and Sejnowksi’s Infomax-based independent components analysis (ICA) algorithm, which applies to nonlinear mappings with equal numbers of inputs and outputs (Bell and Sejnowski, 1995). Bell and Sejnowski

1.2

Sensory Coding

5

showed that when the mapping from inputs to outputs is continuous, nonlinear, and invertible, maximizing the mutual information between inputs and outputs is equivalent to simply maximizing the entropy of the output signal. The algorithm therefore performs a form of ICA. Infomax-based ICA has also been used to model receptive ﬁelds in visual cortex. When applied to natural images, in contrast to principal component analysis (PCA), Infomax-based ICA develops oriented receptive ﬁelds at a variety of spatial scales that are sparse, spatially localized, and reminiscent of oriented receptive ﬁelds in primary visual cortex (Bell and Sejnowski, 1997). Another variant of nonlinear Infomax developed by Okajima and colleagues (Okajima, 2004) has also been applied to modeling higher levels of visual processing, including combined binocular disparity and spatial frequency analysis. 1.2.2

redundancy reduction principle

Barlow’s Redundancy Reduction Principle

The principle of preserving information may be a good description of the very earliest stages of sensory coding, but it is unlikely that this one principle will capture all levels of processing in the brain. Clearly, one can trivially preserve all the information in the input simply by copying the input to the next level up. Thus, the idea only makes sense in the context of additional processing constraints. Implicit in Linsker’s work was the constraint of dimension reduction. However, in the neocortex, there is no evidence of a progressive reduction in the number of neurons at successively higher levels of processing. Barlow proposed a slightly diﬀerent principle of self-organization based on the idea of producing a minimally redundant code. The information about an underlying signal of interest (such as the visual form or the sound of a predator) may be distributed across many input channels. This makes it diﬃcult to associate particular stimulus values with distinct responses. Moreover, there is a high degree of redundancy across diﬀerent channels. Thus, a neural code having minimal redundancy should make it easier to associate diﬀerent stimulus values with diﬀerent responses. The formal, information-theoretic deﬁnition of redundancy is the information content of the stimulus, less the capacity of the channel used to convey the information. Unfortunately, quantities dependent upon calculation of entropy are diﬃcult to compute. Thus, several diﬀerent formulations of Barlow’s principle have been proposed, under varying assumptions and approximations. One simple way for a learning algorithm to lower redundancy is reduce correlations among the outputs (Barlow and F¨ oldi´ ak, 1989). This can remove second-order but not higher-order dependencies. Atick and Redlich proposed minimizing the following measure of redundancy (Atick and Redlich, 1990): R=1−

Iy;s Cout (y)

6

Modeling the Mind: From Circuits to Systems

s Input channel x=s+v1

x Recoding Ax+v2

y Atick and Redlich’s learning principle was to minimize redundancy in the output, y , while preserving information about the input, x.

Figure 1.2

channel capacity

subject to the constraint of zero information loss (ﬁxed Iy;s ). Cout (y), the output channel capacity, is deﬁned to be the maximum of Iy;s . The channel capacity is at a maximum when the covariance matrix of the output elements is diagonal, hence Atick and Redlich used 1 Ryy Cout (y) = 2 i Nv 22 ii Thus, under this formulation, minimizing redundancy amounts to minimizing the the channel capacity. This model is depicted in ﬁg 1.2. This model was used to simulate retinal receptive ﬁelds. Under conditions of high noise (low redundancy), the receptive ﬁelds that emerged were Gaussian-shaped spatial smoothing ﬁlters, while at low noise levels (high redundancy) on-center oﬀ-surround receptive ﬁelds resembling second spatial derivative ﬁlters emerged. In fact, cells in the mammalian retina and lateral geniculate nucleus of the thalamus dynamically adjust their ﬁltering characteristics as light levels ﬂuctuate between these two extremes under conditions of low versus high contrast (Shapley and Victor, 1979; Virsu et al., 1977). Moreover, this strategy of adaptive rescaling of neural responses has been shown to be optimal with respect to information transmission (Brenner et al., 2000). Similar learning principles have been applied by Atick and colleagues to model higher stages of visual processing. Dong and Atick modeled redundancy reduction across time, in a model of visual neurons in the lateral geniculate nucleus of the thalamus (Dong and Atick, 1995). In their model, neurons with both lagged and nonlagged spatiotemporal smoothing ﬁlters emerged. These receptive ﬁelds would be useful for conveying information about stimulus onsets and oﬀsets. Li and Atick (1994) modeled redundancy reduction across binocular visual inputs. Their model generated binocular, oriented receptive ﬁelds at a variety of spatial scales, similar to those seen in primary visual cortex. Bell and Sejnowski’s Infomax-based ICA algorithm (Bell and Sejnowski, 1995) is also closely related to Barlow’s minimal redundancy principal, since the ICA model is restricted to invertible mappings; in

1.2

Sensory Coding

7

X1

f1(s + n1)

a

X2

f2 (s + n2 )

b Maximize l(a;b)

Becker and Hinton’s Imax learning principle maximizes the mutual information between features a and b extracted from diﬀerent input channels.

Figure 1.3

this case, the maximization of mutual information amounts to reducing statistical dependencies among the outputs. 1.2.3

Becker and Hinton’s Imax Principle

The goal of retaining as much information as possible may be a good description of early sensory coding. However, the brain seems to do much more than simply preserve information and recode it into a more convenient form. Our perceptual systems are exquisitely tuned to certain regularities in the world, and consequently to irregularities which violate our expectations. The things which capture our attention and thus motivate us to learn and act are those which violate our expectations about the coherence of the world—the sudden onset of a sound, the appearance of a looming object or a predator. In order to be sensitive to changes in our environment, we require internal representations which ﬁrst capture the regularities in our environment. Even relatively low-order regularities, such as the spatial and temporal coherence of sensory signals, convey important cues for extracting very high level properties about objects. For example, the coherence of the visual signal across time and space allows us to segregate the parts of a moving object from its surrounding background, while the coherence of auditory events across frequency and time permits the segregation of the auditory input into its multiple distinct sources. Becker and Hinton (1992) proposed the Imax principle for unsupervised learning, which dictates that signals of interest should have high mutual information across diﬀerent sensory channels. In the simplest case, illustrated in ﬁg 1.3, there are two input sources, x1 and x2 , conveying information about a common underlying Gaussian signal of interest, s, and each channel is corrupted by independent, additive Gaussian noise: x1 = s + n1 , x2 = s + n2 . However, the input may be high dimensional and may require a nonlinear transformation in order to extract the signal. Thus the goal of the learning is to transform the two input signals into outputs, y1 and y2 , having maximal mutual

8

Modeling the Mind: From Circuits to Systems

Maximize I(a;b)

a

b

Left strip Right strip

Figure 1.4

Imax architecture used to learn stereo features from binary images.

information. Because the signal is Gaussian and the noise terms are assumed to be identically distributed, the information in common to the two outputs can be maximized by maximizing the following log signal-to-noise ratio (SNR): V (y1 + y2 ) Iy1 ;y2 ≈ log V (y1 − y2 ) This could be accomplished by multiple stages of processing in a nonlinear neural circuit like the one shown in ﬁg 1.4. Becker and Hinton (1992) showed that this model could extract binocular disparity from random dot stereograms, using the architecture shown in ﬁg 1.4. Note that this function requires multiple stages of processing through a network of nonlinear neurons with sigmoidal activation functions. The Imax algorithm has been used to learn temporally coherent features (Becker, 1996; Stone, 1996), and extended to learn multidimensional features (Zemel and Hinton, 1991). A very similar algorithm for binary units was developed by Kay and colleagues (Kay, 1992; Phillips et al., 1998). The minimizing disagreement algorithm (de Sa, 1994) is a probabilistic learning procedure based on principles of Bayesian classiﬁcation, but is nonetheless very similar to Imax in its objective to extract classes that are coherent across multiple sensory input channels. 1.2.4

minimum description length

Risannen’s Minimum Description Length Principle

The overall goal of every unsupervised learning algorithm is to discover the important underlying structure in the data. Learning algorithms based on Shannon information have the drawback of requiring knowledge of the probability distribution of the data, and/or of the extracted features, and hence tend to be either very computationally expensive or to make highly simplifying assumptions about the distributions (e.g. binary or Gaussian variables). An alternative approach is to develop a model of the data that is somehow optimal with respect to coding eﬃciency. The minimum description length (MDL) principle, ﬁrst introduced by

1.2

Sensory Coding

9

d data

Figure 1.5

M model

c code

-1

M

d’ reconstructed data

Minimum description length (MDL) principle.

Rissanen (1978), favors models that provide accurate encoding of the data using as simple a model as possible. The rationale behind the MDL principle is that the criterion of discovering statistical regularities in data can be quantiﬁed by the length of the code generated to describe the data. A large number of learning algorithms have been developed based on the MDL principle, but only a few of these have attempted to provide plausible accounts of neural processing. One such example was developed by Zemel and Hinton (1995), who cast the autoencoder problem within an MDL framework. They proposed that the goal of learning should be to encode the total cost of communicating the input data, which depends on three terms, the length of the code, c, the cost of communicating the model, M (which depends on the coding cost of communicating how to reconstruct the data), M −1 , and the reconstruction error: Cost = Length(c) + Length(M −1 ) + Length(|d − d |) as illustrated in ﬁg 1.5. They instantiated these ideas using an autoencoder architecture, with hidden units whose activations were Gaussian functions of the inputs. Under a Gaussian model of the input activations, it was assumed that the hidden unit activations, as a population, encode a point in a lower-dimensional implicit representational space. For example, a population of place cells in the hippocampus might receive very high dimensional multisensory input, and map this input onto a population of neural activations which codes implicitly the animal’s spatial location—a point in a two-dimensional Cartesian space. The population response could be decoded by averaging together the implicit coordinates of the hidden units, weighted by their activations. Zemel and Hinton’s cost function incorporated a reconstruction term and a coding cost term that measured the ﬁt of the hidden unit activations to a Gaussian model of implicit coordinates. The weights of the hidden units and the coordinates in implicit space were jointly optimized with respect to this MDL cost. Algorithms which perform clustering, when cast within a statistical framework, can also be viewed as a form of MDL learning. Nowlan derived such an algorithm, called maximum likelihood competitive learning (MLCL), for training neural net-

10

Modeling the Mind: From Circuits to Systems

works using the expectation maximization (EM) algorithm (Jacobs et al., 1991; Nowlan, 1990). In this framework, the network is viewed as a probabilistic, generative model of the data. The learning serves to adjust the weights so as to maximize the log likelihood of the model having generated the data: L = log P (data | model). If the training patterns, I (α) , are independent, n

L = log

P (I (α) | model)

α=1

=

n

log P (I (α) | model).

α=1

The MLCL algorithm applies this objective function to the case where the units have Gaussian activations and form a mixture model of the data: m n (α) log P (I | submodeli ) P (submodeli ) L= =

α=1 n α=1

log

i=1 m

yi

(α)

πi ,

i=1

where the πi ’s are positive mixing coeﬃcients that sum to one, and the yi ’s are the unit activations: i , Σi ), yi (α) = N (I(α) , w i and covariance matrix where N ( ) is the Gaussian density function, with mean w Σi . The MLCL model makes the assumption that every pattern is independent of every other pattern. However, this assumption of independence is not valid under natural viewing conditions. If one view of an object is encountered, a similar view of the same object is likely to be encountered next. Hence, one powerful cue for real vision systems is the temporal continuity of objects. Novel objects typically are encountered from a variety of angles, as the position and orientation of the observer, or objects, or both, vary smoothly over time. Given the importance of temporal context as a cue for feature grouping and invariant object recognition, it is very likely that the brain makes use of this property of the world in perceptual learning. Becker (1999) proposed an extension to MLCL that incorporates context into the learning. Relaxing the assumption that the patterns are independent, allowing for temporal dependencies among the input patterns, the log likelihood function becomes: L = log P (data | model) log P (I (α) | I (1) , . . . , I (α−1) , model). = α

1.2

Sensory Coding

11

context

Secondary Input source

ci gj

Modulatory signal

model

data

Figure 1.6

Gating units

yk

Clustering units

Im

Input units

Contextually modulated competitive learning.

To incorporate a contextual information source into the learning equation, a contextual input stream was introduced into the likelihood function: L = log P (data | model, context) log P (I (α) | I (1) , . . . , I (α−1) , model, context), = α

as depicted in ﬁg 1.6. This model was trained on a series of continuously rotating images of faces, and learned a representation that categorized people’s faces according to identity, independent of viewpoint, by taking advantage of the temporal continuity in the image sequences. Many models of population encoding apply to relatively simple, one-layer feedforward architectures. However, the structure of neocortex is much more complex. There are multiple cortical regions, and extensive feedback connections both within and between regions. Taking these features of neocortex into account, Hinton has developed a series of models based on the Boltzmann machine (Ackley et al., 1985), and the more recent Helmholtz machine (Dayan et al., 1995) and Product of Experts (PoE) model(Hinton, 2000; Hinton and Brown, 2000). The common idea underlying these models is to try to ﬁnd a population code that forms a causal model of the underlying data. The Boltzmann machine was unacceptably slow at sampling the “unclamped” probability distribution of the unit states. The Helmholtz machine and PoE model overcome this limitation by using more restricted architectures and/or approximate methods for sampling the probability distributions over units’ states (see ﬁg 1.7A). In both cases, the bottom-up weights embody a “recognition model”; that is, they are used to produce the most probable set of hidden states given the data. At the same time, the top-down weights constitute a “generative model”; that is, they produce a set of hidden states most likely to have generated the data. The “wake-sleep algorithm” maximizes the log likelihood

A)

experts

p1

d1

d2

p2

d3

inference weights

Modeling the Mind: From Circuits to Systems

generative weights

12

data

P0

B)

P1

d1 d2 d3

c1 c2 c3

Hinton’s Product of Experts model, showing (A) the basic architecture, and (B) Brief Gibbs sampling, which involves several alternating iterations of clamping the input units to sample from the hidden unit states, and then clamping the hidden units to sample from the input unit states. This procedure samples the “unclamped” distribution of states in a local region around each data vector and tries to minimize the diﬀerence between the clamped and unclamped distributions.

Figure 1.7

of the data under this model and results in a simple equation for updating either set of weights: α α Δwkj = εsα k (sj − pj ), α where pα j is the target state for unit j on pattern α, and sj is the corresponding network state, a stochastic sample based on the logistic function of the unit’s net input. Target states for the generative weight updates are derived from topdown expectations based on samples using the recognition model, whereas for the recognition weights, the targets are derived by making bottom-up predictions based on samples from the generative model. The Products of Experts model advances on this learning procedure by providing a very eﬃcient procedure called “brief Gibbs sampling” for estimating the most probable states to have generated the data, as illustrated in ﬁg 1.7B).

1.3

1.3

Models of Episodic Memory

13

Models of Episodic Memory Moving beyond sensory coding to high-level memory systems in the medial temporal lobe (MTL), the brain may use very diﬀerent optimization principles aimed at the memorization of complex events or spatiotemporal episodes, and at subsequent reconstruction of details of these episodic memories. Here, rather than recoding the incoming signals in a way that abstracts away unnecessary details, the goal is to memorize the incoming signal as accurately as possible in a single learning trial. The hippocampus is a key structure in the MTL that appears to be crucial for episodic memory. It receives input from most cortical regions, and is at the point of convergence between the ventral and dorsal visual pathways, as illustrated in ﬁg 1.8 (adapted from (Mishkin et al., 1997)). Some of the unique anatomical and physiological characteristics of the hippocampus include the following: (1) the very large expansion of dimensionality from the entorhinal cortex (EC) to the dentate gyrus (DG) (the principal cells in the dentate gyrus outnumber those of the EC by about a factor of 5 in the rat (Amaral et al., 1990)); (2) the large and potent mossy ﬁber synapses projecting from CA3 to CA1, which are the largest synapses in the brain and have been referred to as “detonator synapses” (McNaughton and Morris, 1987); and (3) the extensive set of recurrent collateral connections within the CA3 region. In addition, the hippocampus exhibits unique physiological properties including (1) extremely sparse activations (low levels of activity), particularly in the dentate gyrus where ﬁring rates of granule cells are about 0.5 Hz (Barnes et al., 1990; Jung and McNaughton, 1993), and (2) the constant replacement of neurons (neurogenesis) in the dentate gyrus: about about 1% of the neurons in the dentate gyrus are replaced each day in young adult rats (Martin Wojtowicz, University of Toronto, unpublished data). In 1971 Marr put forward a highly inﬂuential theory of hippocampal coding (Marr, 1971). Central to Marr’s theory were the notions of a rapid, temporary memory store mediated by sparse activations and Hebbian learning, an associative retrieval system mediated by recurrent connections, and a gradual consolidation process by which new memories would be transferred into a long-term neocortical store. In the decades since the publication of Marr’s computational theory, many researchers have built on these ideas and simulated memory formation and retrieval in Marr-like models of the hippocampus. For the most part, modelers have focused on either the CA3 or CA1 ﬁelds, using variants of Hebbian learning, for example, competitive learning in the dentate gyrus and CA3 (Hasselmo et al., 1996; McClelland et al., 1995; Rolls, 1989), Hebbian autoassociative learning (Kali and Dayan, 2000; Marr, 1971; McNaughton and Morris, 1987; O’Reilly and Rudy, 2001; Rolls, 1989; Treves and Rolls, 1992), temporal associative learning (Gerstner and Abbott, 1997; Levy, 1996; Stringer et al., 2002; Wallenstein and Hasselmo, 1997) in the CA3 recurrent collaterals, and Hebbian heteroassociative learning between EC-driven CA1 activity and CA3 input (Hasselmo and Schnell, 1994) or between EC-driven and CA3-driven CA1 activity at successive points in time (Levy et al., 1990). The key ideas behind these models are summarized in ﬁg 1.9.

14

Modeling the Mind: From Circuits to Systems

CA1

CA3

HPC DG

Sub

Entorhinal Cortex

Perirhinal cortex

Ventral stream inputs

Parahippocampal cortex

Dorsal stream inputs

Some of the main anatomical connections of the hippocampus. The hippocampus is a major convergence zone. It receives input via the entorhinal cortex from most regions of the brain including the ventral and dorsal visual pathways. It also sends reciprocal projections back to most regions of the brain. Within the hippocampus, the major regions are the dentate gyrus (DG), CA3, and CA1. The CA1 region projects back to the entorhinal cortex, thus completing the loop. Note that the subiculum, not shown here, is another major output target of the hippocampus.

Figure 1.8

In modeling the MTL’s hippocampal memory system, Becker (2005) has shown that a global optimization principle based on the goal of accurate input reconstruction, combined with neuroanatomical constraints, leads to simple, biologically plausible learning rules for all regions within the hippocampal circuit. The model exhibits the key features of an episodic memory system: high storage capacity, accurate cued recall, and association of items across time, under extremely high plasticity conditions. The key assumptions in Becker’s model are as follows: During encoding, dentate granule cells are active whereas during retrieval they are relatively silent. During encoding, activation of CA3 pyramidals is dominated by the very strong mossy ﬁber inputs from dentate granule cells. During retrieval, activation of CA3 pyramidals is driven by direct perforant path inputs from the entorhinal cortex combined with time-delayed input from CA3 via recurrent collaterals. During encoding, activation of CA1 pyramidals is dominated by direct perforant path inputs from the entorhinal cortex.

1.3

Models of Episodic Memory

15

A)

1 0 1 0

C) 1 0 0 1 1 1 0 0

B)

1 0 1 0

Various models have been proposed for speciﬁc regions of the hippocampus, for example, (A) models based on variants of competitive learning have been proposed for the dentate gyrus; (B) many models of the CA3 region have been based upon the recurrent autoassociator, and (C) several models of CA1 have been based on the heteroassociative network, where the input from the entorhinal cortex to CA1 acts as a teaching signal, to be associated with the (nondriving) input from the CA3 region.

Figure 1.9

During retrieval, CA1 activations are driven by a combination of perforant path inputs from the entorhinal cortex and Shaﬀer collateral inputs from CA3. Becker proposed that each hippocampal layer should form a neural representation that could be transformed in a simple manner—i.e. linearly—to reconstruct the original activation pattern in the entorhinal cortex. With the addition of biologically plausible processing constraints regarding connectivity, sparse activations, and two modes of neuronal dynamics during encoding versus retrieval, this results in very simple Hebbian learning rules. It is important to note, however, that the model itself is highly nonlinear, due to the sparse coding in each region and the multiple stages of processing in the circuit as a whole; the notion of linearity only comes in at the point of reconstructing the EC activation pattern from any one region’s activities. The objective function made use of the idea of an implicit set of reconstruction weights from each hippocampal region, by assuming that the perforant path connection weights could be used in reverse to reconstruct the EC input pattern. Taking the CA3 layer as an example, the CA3 neurons receive perforant path input from the entorhinal cortex, EC (in) , associated with a matrix of weights W (EC,CA3) . The CA3 region also receives input connections from the dentate gyrus, DG, with associated weights W (DG,CA3) as well

16

Modeling the Mind: From Circuits to Systems

as recurrent collateral input from within the CA3 region with connection weights W (CA3,CA3) . Using the transpose of the perforant path weights, (W (EC,CA3) )T , to calculate the CA3 region’s reconstruction of the entorhinal input vector T

EC (reconstructed) = W (EC,CA3) CA3,

(1.1)

the goal of the learning is to make this reconstruction as accurate as possible. To quantify this goal, the objective function Becker proposed to be maximized here is the cosine angle between the original and reconstructed activations: T

P erf (CA3) = cos(EC (in) , W (EC,CA3) CA3) T

=

(EC (in) )T (W (EC,CA3) CA3) T

||EC (in) || ||W (EC,CA3) CA3||

.

(1.2)

By rearranging the numerator, and appropriately constraining the activation levels and the weights so that the denominator becomes a constant, it is equivalent to maximize the following simpler expression: T

P erf (CA3) = (W (EC,CA3) EC (in) ) CA3,

(1.3)

which makes use of the locally available information arriving at the CA3 neurons’ incoming synapses: the incoming weights and activations. This says that the incoming weighted input from the perforant path should be as similar as possible to the activation in the CA3 layer. Note that the CA3 activation, in turn, is a function of both perforant path and DG input as well as CA3 recurrent input. The objective functions for the dentate and CA1 regions have exactly the same form as equation 1.3, using the DG and CA1 activations and perforant path connection weights respectively. Thus, the computational goal for the learning in each region is to maximize the overlap between the perforant path input and that region’s reconstruction of the input. This objective function can be maximized with respect to the connection weights on each set of input connections for a given layer, to derive a set of learning equations. By combining the learning principle with the above constraints, Hebbian learning rules are derived for the direct (monosynaptic) pathways from the entorhinal cortex to each hippocampal region, a temporal Hebbian associative learning rule is derived for the CA3 recurrent collateral connections, and a form of heteroassociative learning is derived the Shaﬀer collaterals (the projection from CA3 to CA1). Of fundamental importance for computational theories of hippocampal coding is the striking ﬁnding of neurogenesis in the adult hippocampus. Although there is now a large literature on neurogenesis in the dentate gyrus, and it has been shown to be important for at least one form of hippocampal-dependent learning, surprisingly few attempts have been made to reconcile this phenomenon with theories of hippocampal memory formation. Becker (2005) suggested that the function of new neurons in the dentate gyrus is in the generation of novel codes. Gradual changes in the internal code of the dentate layer were predicted to facilitate the formation of distinct representations for highly similar memory episodes.

1.4

Representations That Guide Action Selection

17

environmental feedback

stimulus coding

posterior cortical areas

memorization of complex events

medial temporal lobe, hippocampus

prediction and control

motor output

prefrontal cortex

Architecture of a learning system that incorporates perceptual learning, episodic memory, and motor control.

Figure 1.10

Why doesn’t the constant turnover of neurons in the dentate gyrus, and hence the constant rewiring of the hippocampal memory circuit, interfere with the retrieval of old memories? The answer to this question comes naturally from the above assumptions about neuronal dynamics during encoding versus retrieval. New neurons are added only to the dentate gyrus, and the dentate gyrus drives activation in the hippocampal circuit only during encoding, not during retrieval. Thus, the new neurons contribute to the formation of distinctive codes for novel events, but not to the associative retrieval of older memories.

1.4

Representations That Guide Action Selection Moving beyond the question of how information is represented, we must consider the brain not simply a passive storage device, but as a part of a dynamical computational system that acts and reacts to changes within its environment, as illustrated in ﬁg 1.10. Ultimately, models based on the broad goals of prediction and control may be our best hope for characterizing complex dynamical systems which form representations in the service of guiding motor actions. Reinforcement learning algorithms can be applied to control problems, and have been linked closely to speciﬁc neural mechanisms. These algorithms are built upon the concept of a value function, V (st ), which deﬁnes the value of being in the current state st at time t to be equal to the expected sum of future rewards: V (t) = rt + γrt+1 + γ 2 rt2 + . . . + γ n rtn + . . . The parameter γ, chosen to be in the range 0 ≤ γ ≤ 1, is a temporal discount

18

TD-learning

Q-learning

Modeling the Mind: From Circuits to Systems

factor which permits one to heuristically weight future rewards more or less heavily according to the task demands. Within this framework, the goal for the agent is to choose actions that will maximize the value function. In order for the agent to solve the control problem—how to select optimal actions, it must ﬁrst solve the prediction problem—how to estimate the value function. The temporal diﬀerence (TD) learning algorithm (Sutton, 1988; Sutton and Barto, 1981) provides a rule for incrementally updating an estimate Vˆt of the true value function at time t by an amount called the TD-error: TD-error = rt+1 + γ Vˆt+1 − Vˆt , which makes use of rt , the amount of reward received at time t, and the value estimates at the current and the next time step. It has been proposed that the TD-learning algorithm may be used by neurobiological systems, based on evidence that ﬁring of midbrain dopamine neurons correlates well with TD-error (Montague et al., 1996). The Q-learning algorithm (Watkins, 1989) extends the idea of TD learning to the problem of learning an optimal control policy for action selection. The goal for the agent is to maximize the total future expected reward. The agent learns incrementally by trial and error, evaluating the consequences of taking each action in each situation. Rather than using a value function, Q-learning employs an actionvalue function, Q(st , at ), which represents the value in taking an action at when the state of the environment is st . The learning algorithm for incrementally updating estimates of Q-values is directly analogous to TD learning, except that the TD-error is replaced by a temporal diﬀerence between Q-values at successive points in time. Becker and Lim (2003) proposed a model of controlled memory retrieval based upon Q-learning. People have a remarkable ability to encode and retrieve information in a ﬂexible manner. Understanding the neuronal mechanisms underlying strategic memory use remains a true challenge. Neural network models of memory have typically dealt with only the most basic operations involved in storage and recall. Evidence from patients with frontal lobe damage indicates a crucial role for the prefrontal cortex in the control of memory. Becker and Lim’s model was developed to shed light on the neural mechanisms underlying strategic memory use in individuals with intact and lesioned frontal lobes. The model was trained to simulate human performance on free-recall tasks involving lists of words drawn from a small set of categories. Normally when people are asked repeatedly to study and recall the same list of words, their recall patterns demonstrate progressively more categorical clustering over trials. This strategy thus appears to be learned, and correlates with overall recall scores. On the other hand, when patients with frontal lobe damage perform such tests, while they do beneﬁt somewhat from the categorical structure of word lists, they tend to recall fewer categories in total, and tend to show lower semantic clustering scores. Becker and Lim (2003) postulated a role for the prefrontal cortex (PFC) in self-organizing novel mnemonic codes that could subsequently be used as retrieval cues to improve retrieval from long-term memory. Their model is outlined in ﬁg 1.11. The “actions” or responses in this model are actually the activations generated by model neurons in the PFC module. Thus, the activation of each response unit is proportional to the network’s current estimate of the Q-value associated with

Representations That Guide Action Selection

19

B) Internally generated item

A) Externally presented item

delay

jacket

...

delay

pear apple vest

dela y

pear apple vest

dela y

1.4

jacket WM units response units

lexical units semantic feature units

internal reinforcement

Context units

Lexical/ Other semantic input/output cortical modules module

MTL memory module

PFC module

...

WM units response units

lexical units semantic feature units

internal reinforcement

Context units

Lexical/ Other semantic input/output cortical modules module

MTL memory module

PFC module

Figure 1.11 Becker and Lim’s architecture for modeling the frontal control of memory retrieval. The model operated in two diﬀerent modes: (A) During perception of an external stimulus (during a study phase) there was bottom-up ﬂow of activation. (B) During free recall, when a response was generated internally, there was a top-down ﬂow of activation in the model. After an item was retrieved, but before a response was generated, the item was used to probe the MTL memory system, and its recency was evaluated. If the recency (based on a match of the item to the memory weight matrix) was too high, the item was considered to be a repetition error, and if too low, it was considered to be an extralist intrusion error. Errors detected by the model were not generated as responses, but were used to generate internal reinforcement signals for learning the PFC module weights. Occasionally, a repetition or intrusion error might go undetected by the model, resulting in a recall error.

that response, and response probabilities are calculated directly from these Qvalues. Learning the memory retrieval strategy involved adapting the weights for the response units so as to maximize their associated Q-values. Reinforcement obtained on a given trial was self-generated by an internal evaluation module, so that the PFC module received a reward whenever a nonrepeated study list item was retrieved, and a punishment signal (negative reinforcement) when a nonlist or repeated item was retrieved. The model thereby learned to develop retrieval strategies dynamically in the course of both study and free recall of words. The model was able to capture the performance of human subjects with both intact and lesioned frontal lobes on a variety of types of word lists, in terms of both recall accuracy and patterns of errors. The model just described addresses a rather high level of complex action selection, namely, the selection of memory retrieval strategies. Most work on modeling action selection has dealt with more concrete and observable actions such as the choice of lever-presses in a response box or choice of body-turn directions in a maze. The advantage of this level of modeling is that it can make contact with a large body of experimental literature on animal behavior, pharmacology, and physiology. Many such models have employed TD-learning or Q-learning, under the assumption that animals form internal representations of value functions, which

Modeling the Mind: From Circuits to Systems

Terminal

A) MDP for modeling the T-maze

0

0s

0s Left Right Do nothing

4 Food

2

0s 0

0s 0s

Wall

Food

0 Start

B) T-maze with barrier on left arm

C) Model choice behavior vs. DA level 4

start

Expected reward (pellets)

20

Left Right

3

2

1

0 1

0.75

0.5

0.25

0

DA

Simulation of Cousins et al.’s T-maze cost-beneﬁt task. The MDP representation of the task is shown in panel A, the T-maze with a barrier and larger reward in the left arm of the maze is shown in panel B, and the performance of the model as a function of dopamine depletion is shown in panel C.

Figure 1.12

guide action selection. As mentioned above, phasic ﬁring of dopamine neurons has been postulated to convey the TD-error signal critical for this type of learning. However, in addition to its importance in modulating learning, dopamine plays an important role in modulating action choice. It has been hypothesized that tonic levels of dopamine have more to do with motivational value, whereas the phasic ﬁring of dopamine neurons conveys a learning-related signal (Smith et al., 2005). Rather than assuming that actions are solely guided by value functions, Smith et al. (2005) hypothesized that animals form detailed internal models of the world. Value functions condense the reward value of a series of actions into that of a single state, and are therefore insensitive to the motivational state of the animal (e.g., whether it is hungry or not). Internal models, on the other hand, allow a mental simulation of alternative action choices, which may result in qualitatively diﬀerent rewards. For example, an animal might perform one set of actions leading to water only if it is thirsty, and another set of actions leading to food only if it is hungry. The internal model can be described by a Markov decision process (MDP) over a set of internal states, with associated transition function and reward function, as in ﬁg 1.12A). The transition function and (immediate) reward value of each

1.5

New Directions: Integrating Multiple Memory Systems

21

state are learned through trial and error. Once the model is fully trained, action selection involves simulating a look-ahead process in the internal model for one or more steps in order to evaluate the consequences of an action. Finally, at the end of the simulation sequence, the animal’s internal model reveals whether the outcome is favorable (leads to reward) or not. An illustrative example is shown in ﬁg 1.12B. The choice faced by the animal is either to take the right arm of the T-maze to receive a small reward, or to take the left arm and then jump over a barrier to receive a larger reward. The role of tonic dopamine in this model is to modulate the eﬃcacy of the connections in the internal model. Thus, when dopamine is depleted, the model’s ability to simulate the look-ahead process to assess expected future reward will be biased toward rewards available immediately rather than more distal rewards. This implements an online version of temporal discounting. Cousins et al. (1996) found that normal rats trained in the T-maze task in ﬁg 1.12B) are willing to jump the barrier to receive a larger food reward nearly 100% of the time. Interestingly, however, when rats were administered a substance that destroys dopaminergic (DA) projections to the nucleus accumbens (DA lesion), they chose the smaller reward. In another version of the task, rats were trained on the same maze except that there was no food in the right arm, and then when given DA lesions, they nearly always chose the left arm and jumped the barrier to receive a reward. Thus, the DA lesion was not merely disrupting motor behavior, it was interacting with the motivational value of the behavioral choices. Note that the TD-error account of dopamine only provides for a role in learning, and would have nothing to say about eﬀects of dopamine on behavior subsequent to learning. Smith et al. (2005) argued, based on these and other data, that dopamine serves to modulate the motivational choice of the animals, with high levels of dopamine favoring the selection of action sequences with more distal but larger rewards. In simulations of the model, depletion of dopamine therefore biases the choice in favor of the right arm in this task, as shown in ﬁg 1.12C).

1.5

New Directions: Integrating Multiple Memory Systems In this chapter, we have reviewed several approaches to modeling the mind, from low-level sensory coding, to high-level memory systems, to action selection. Somehow, the brain accomplishes all of these functions, and it is highly unlikely that they are carried out in isolation from one another. For example, we now know that striatal dopaminergic pathways, presumed to carry a reinforcement learning signal, aﬀect sensory coding even in early sensory areas such as primary auditory cortex (Bao et al., 2001). Future work must address the integration of these various levels of modeling.

2

Empirical Statistics and Stochastic Models for Visual Signals

David Mumford

Bayes’s rule

The formulation of the vision problem as a problem in Bayesian inference (Forsyth and Ponce, 2002; Mumford, 1996, 2002) is, by now, well known and widely accepted in the computer vision community. In fact, the insight that the problem of reconstructing 3D information from a 2D image is ill posed and needs inference can be traced back to the Arab scientist Ibn Al-Haytham (known to Europe as Alhazan) around the year 1000 (Haytham, c. 1000). Inheriting a complete hodgepodge of conﬂicting theories from the Greeks,1 Al-Haytham for the ﬁrst time demonstrated that light rays originated only in external physical sources, and moved in straight lines, reﬂecting and refracting, until they hit the eye; and that the resulting signal needed to be and was actively decoded in the brain using a largely unconscious and very rapid inference process based on past visual experiences. In the modern era, the inferences underlying visual perception have been studied by many people, notably H. Helmholtz, E. Brunswik (Brunswik, 1956), and J. J. Gibson. In mathematical terms, the Bayesian formulation is as follows: let I be the observed image, a 2D array of pixels (black-and-white or colored or possibly a stereoscopic pair of such images). Here we are assuming a static image.2 Let w stand for variables that describe the external scene generating the image. Such variables should include depth and surface orientation information (Marr’s 2.5 D sketch), location and boundaries of the principal objects in view, their surface albedos, location of light sources, and labeling of object categories and possibly object identities. Then two stochastic models, learned from past experience, are required: a prior model p(w) specifying what scenes are likely in the world we live in and an imaging model p(I|w) specifying what images should look like, given the scene. Then by Bayes’s rule: p(w|I) =

p(I|w)p(w) ∝ p(I|w)p(w). p(I)

Bayesian inference consists in ﬁxing the observed value of I and inferring that w equals that value which maximizes p(w|I) or equivalently maximizes p(I|w)p(w). This is a ﬁne general framework, but to implement or even test it requires (1) a

24

Empirical Statistics and Stochastic Models for Visual Signals

theory of stochastic models of a very comprehensive sort which can express all the complex but variable patterns which the variables w and I obey, (2) a method of learning from experience the many parameters which such theories always contain, and (3) a method of computing the maximum of p(w|I). This chapter will be concerned only with problem 1. Many critiques of vision algorithms have failed to allow for the fact that these are three separate problems: if 2 or 3, the methods, are badly implemented, the resulting problems do not imply that the theory itself (1) is bad. For example, very slow algorithms of type 3 may reasonably be used to test ideas of type 1. Progress in understanding vision does not require all these problems to be solved at once. Therefore, it seems to me legitimate to isolate problems of type 1. In the rest of this chapter, I will review some of the progress in constructing these models. Speciﬁcally, I will consider, in section 2.1, models of the empirical probability distribution p(I) inferred from large databases of natural images. Then, in section 2.2, I will consider the problem of the ﬁrst step in so-called intermediate vision: inferring the regions which should be grouped together as single objects or structures, a problem which includes segmentation and gestalt grouping, the basic grammar of image analysis. Finally in section 2.3, I look at the problem of priors on 2D shapes and the related problem of what it means for two shapes to be “similar”. Obviously, all of these are huge topics and I cannot hope to give a comprehensive view of work on any of them. Instead, I shall give my own views of some of the important issues and open problems and outline the work that I know well. As this inevitably emphasizes the work of my associates, I must beg indulgence from those whose work I have omitted.

2.1

Statistics of the Image Alone The most direct approach to studying images is to ask whether we can ﬁnd good models for images without any hidden variables. This means ﬁrst creating a large database of images I that we believe are reasonably random samples of all possible images of the world we live in. Then we can study this database with all the tools of statistics, computing the responses of various linear and nonlinear ﬁlters and looking at the individual and joint histograms of their values. “Nonlinear” should be taken in the broadest sense, including order statistics or topological analyses. We then seek to isolate the most important properties these statistics have and to create the simplest stochastic models p(I) that duplicate or approximate these statistics. The models can be further tested by sampling from them and seeing if the resulting artiﬁcial images have the same “look and feel” as natural images; or if not, what are the simplest properties of natural images that we have failed to capture. Another recent survey of such models is referred to (Lee et al., 2003b).

2.1

Statistics of the Image Alone

2.1.1

kurtosis

25

High Kurtosis As The Universal Clue To Discrete Structure

The ﬁrst really striking thing about ﬁlter responses is that they always have large kurtosis. It is strange that electrical engineers designing TV sets in the 1950s do not seem to have pointed this out and this fact ﬁrst appeared in the work of David Field (Field, 1987). By kurtosis, we mean the normalized fourth moment. If x is a random real number, its kurtosis is ¯)2 )2 . κ(x) = E((x − x ¯)4 )/E((x − x

stationary Markov process

Every normal variable has kurtosis 3; a variable which has no tails (e.g., uniformly distributed on an interval) or is bimodal and small at its mean tends to have kurtosis less than 3; a variable with heavy tails or large peak at its mean tends to have kurtosis larger than 3. The empirical result which is observed for images is that for any linear ﬁlter F with zero mean, the values x = (F ∗ I)(i, j) of the ﬁltered image follow a distribution with kurtosis larger than 3. The simplest case of this is the diﬀerence of adjacent pixel values, the discrete derivative of the image I. But it has been found (Huang, 2000) to hold even for random mean zero ﬁlters supported in an 8 × 8 window. This high kurtosis is shown in ﬁg 2.1, from the thesis of J. Huang (Huang, 2000). This data was extracted from a large database of high-resolution, fully calibrated images of cities and country taken in Holland by van Hateren (1998). It is important, when studying tails of distributions, to plot the logarithm of the probability or frequency, as in this ﬁgure, not the raw probability. If you plot probabilities, all tails look alike. But if you plot their logarithms, then a normal distribution becomes a 2 downward facing parabola (since log(e−x ) = −x2 ), so heavy tails appear clearly as curves which do not point down so fast. It is a well-known fact from probability theory that if Xt is a stationary Markov stochastic process, then the kurtosis of Xt − Xs being greater than 3 means that the process Xt has discrete jumps. In the case of vision, we have samples from an image I(s, t) depending on two variables rather than one and the zero-mean ﬁlter is a generalization of the diﬀerence Xt − Xs . Other signals generated by the world, such as sound or prices, are functions of one variable, time. A nice elementary statement of the link between kurtosis and jumps is given by the following result, taken from Mumford and Desolneux: Theorem 2.1 Let x be any real random variable which we normalize to have mean 0 and standard deviation 1. Then there is a constant c > 0 depending only on x such that if, for some n, x is the sum x = y1 + y2 + · · · + y n where the yi are independent and identically distributed, then

Prob max |yi | ≥ (κ(x) − 3)/2 ≥ c. i

26

Empirical Statistics and Stochastic Models for Visual Signals

2

3

0

2

−2

1

−4 0

log(pdf)

−6 −8 −10

−1

−2

−3

−12 −4

−14 −5

−16

−6

−18 −20 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−7 −5

−4

−3

−2

−1

0

1

2

3

standard deviation

Histograms of ﬁlter values from the thesis of J. Huang, using the van Hateren database. On the left, the ﬁlter is the diﬀerence of (a) horizontally adjacent pixels, and (b) of adjacent 2 × 2, (c) 4 × 4 and (d) 8 × 8 blocks; on the right, several random mean-zero ﬁlters with 8 × 8 pixel support have been used. The kurtosis of all these ﬁlter responses is between 7 and 15. Note that the vertical axis is log of the frequency, not frequency. The histograms on the left are displaced vertically for legibility and the dotted lines indicate one standard deviation. Figure 2.1

generalized Laplacian distribution

Bessel distribution

A striking application of this is to the stock market. Let x be the log price change of the opening and closing price of some stock. If we assume price changes are Markov, as many have, and use the experimental fact that price changes have kurtosis greater than 3, then it implies that stock prices cannot be modeled as a continuous function of time. In fact, in my own ﬁt of some stock-market data, I found the kurtosis of log price changes to be inﬁnite: the tails of the histogram of log price changes appeared to be polynomial, like 1/xα with α between 4 and 5. An important question is, How big are the tails of the histograms of image ﬁlter statistics? Two models have been proposed for these distributions. The ﬁrst is the most commonly used model, the generalized Laplacian distribution:

b b e−|x/a| , Z = e−|y/a| dy. plaplace (x) = Z Here a is a scale parameter and b controls how large the tails are (larger tails for smaller b). Experimentally, these work well and values of b between 0.5 and 1 are commonly found. However, no rationale for their occurrence seems to have been found. The second is the Bessel distribution (Grenander and Srivastava, 2001; Wainwright and Simoncelli, 2000): pbessel (x) = q(ξ),

q(ξ) = 1/(1 + (aξ 2 ))b/2 .

Again, a is a scale parameter, b controls the kurtosis (as before, larger kurtosis for smaller b), and the hat means Fourier transform. pbessel (x) can be evaluated

4

5

2.1

Statistics of the Image Alone

27

explicitly using Bessel functions. The tails, however, are all asymptotically like those of double exponentials e−|x/a| , regardless of b. The key point is that these distributions arise as the distributions of products r ·x of Gaussian random variables x and an independent positive “scaling” random variable r. For some values of b, the variable r is distributed like x for a Gaussian x ∈ Rn , but in general its square has a gamma (or chi-squared) distribution. The great appeal of such a product is that images are also formed as products, especially as products of local illumination, albedo, and reﬂectance factors. This may well be the deep reason for the validity of the Bessel models. Convincing tests of which model is better have not been made. The diﬃculty is that they diﬀer most in their tails, where data is necessarily very noisy. The best approach might be to use the Kolmogorov-Smirnov statistic and compare the best-ﬁtting models for this statistic of each type. The world seems to be composed of discrete jumps in time and discrete objects in space. This profound fact about the physical nature of our world is clearly mirrored in the simple statistic kurtosis. 2.1.2

Scaling Properties of Images and Their Implications

scale invariance After high kurtosis, the next most striking statistical property of images is their approximate scale invariance. The simplest way to deﬁne scale invariance precisely is this: imagine we had a database of 64 × 64 images of the world and that this could be modeled by a probability distribution p64 (I) in the Euclidean space R4096 of all such images. Then we can form marginal 32 × 32 images in two diﬀerent ways: we either extract the central 32 × 32 set of pixels from the big image I or we cover the whole 64 × 64 image by 1,024 2 × 2 blocks of pixels and average each such block to get a 32 × 32 image (i.e., we “blow down” I in the crudest way). The assertion that images are samples from a scale-invariant distribution is that the two resulting marginal distributions on 32 × 32 images are the same. This should happen for images of any size and we should also assume that the distribution is stationary, i.e., translating an image gives an equally probable image. The property is illustrated in ﬁg 2.2. It is quite remarkable that, to my knowledge, no test of this hypothesis on reasonably large databases has contradicted it. Many histograms of ﬁlter responses on successively blown down images have been made; order statistics have been looked at; and some topological properties derived from level curves have been studied (Geman and Koloydenko, 1999; Gousseau, 2000; Huang, 2000; Huang and Mumford, 1999). All have shown approximate scale invariance. There seem to be two simple facts about the world which combine to make this scale invariance approximately true. The ﬁrst is that images of the world are taken from random distances: you may photograph your spouse’s face from one inch away or from 100 meters away or anything in between. On your retina, except for perspective distortions, his or her image is scaled up or down as you move closer or farther away. The second is that objects tend to have surfaces on which smaller objects

28

Empirical Statistics and Stochastic Models for Visual Signals

20

40

60

80

100

120 20

40

60

80

100

120

10

10

20

20

30

30

40

40

50

50

60

60 10

20

30

40

50

60

10

20

30

40

50

60

Scale invariance deﬁned as a ﬁxed point under block renormalization. The top is random 2N × 2N image which produces the two N × N images on the bottom, one by extracting a subimage, the other by 2 × 2 block averaging. These two should have the same marginal distributions. (Figure from A. Lee.)

Figure 2.2

power law

cluster: your body has limbs which have digits which have hairs on them, your oﬃce has furniture which has books and papers which have writing (a limiting case of very ﬂat objects on its surface) on them, etc. Thus a blowup of a photograph not only shows roughly the same number of salient objects, but they occur with roughly the same contrast.3 The simplest consequence of scale invariance is the law for the decay of power at high frequencies in the Fourier transform of images (or better, the discrete cosine transform to minimize edge eﬀects). It says that the expected power as a function of frequency should drop oﬀ like

ˆ η)|2 ≈ C/(ξ 2 + η 2 ) = C/f 2 , EI |I(ξ, where f = ξ 2 + η 2 is the spatial frequency. This power law was discovered in the 1950s. In the image domain, it is equivalent to saying that the autocorrelation of the image is approximated by a constant minus log of the distance: ¯ ¯ (I(x, y) − I).(I(x + a, y + b) − I) ≈ C − log( a2 + b2 ). EI x,y

Note that the models have both infrared4 and ultraviolet divergences: the total

2.1

Statistics of the Image Alone

29

power diverges for both f → 0 and ∞, and the autocorrelation goes to ±∞ as a, b → 0, and ∞. Many experiments have been made testing this law over moderate ranges of frequencies and I believe the conclusion to draw is this: for small databases of images, especially databases of special sorts of scenes such as forest scenes or city scenes, diﬀerent powers are found to ﬁt best. These range from 1/f 3 to 1/f but with both a high concentration near 1/f 2 and a surprisingly large variance5 (Frenkel et al.; Huang, 2000). But for large databases, the rule seems to hold. Another striking consequence of the approximate scale-invariance is that images, if they have inﬁnitely high resolution, are not functions at all but must be considered generalized functions (distributions in the sense of Schwartz). This means that as their resolution increases, natural images do not have deﬁnite limiting numerical values I(x, y) at almost all points x, y in the image plane. I think of this as the “mites on your eyelashes” theorem. Biologists tell us that such mites exist and if you had Superman’s X-ray vision, you not only could see them but by the laws of reﬂectance, they would have high contrast, just like macroscopic objects. This mathematical implication is proven by Gidas and Mumford (Gidas and Mumford, 2001). This conclusion is quite controversial: others have proposed other function spaces as the natural home for random images. An early model for images (Mumford and Shah, 1989) proposed that observed images were naturally a sum: I(x, y) = u(x, y) + v(x, y), where u was a piecewise smooth “cartoon”, representing the important content of the image, and v was some L2 noise. This led to the idea that the natural function space for images, after the removal of noise, was the space of functions of bounded variation, i.e., ||∇I||dxdy < ∞. However, this approach lumped texture in with noise and results in functions u from which all texture and ﬁne detail has been removed. More recent models, therefore, have proposed that I(x, y) = u(x, y) + v(x, y) + w(x, y), where u is the cartoon, v is the true texture, and w is the noise. The idea was put forward by DeVore and Lucier (1994) that the true image u+v belongs to a suitable Besov space, spaces of functions f (x, y) for which bounds are put on the Lp norm of f (x + h, y + k) − f (x, y) for (h, k) small. More recently, Carasso has simpliﬁed their approach (Carasso, 2004) and hypothesizes that images I, after removal of “noise” should satisfy

|I(x + h, y + k) − I(x, y)|dxdy < C(h2 + k 2 )α/2 , for some α as (h, k) → 0. However, a decade ago, Rosenfeld argued with me that most of what people discard as “noise” is nothing but objects too small to be fully resolved by the resolution of the camera and thus blurred beyond recognition or even aliased. I think

30

Empirical Statistics and Stochastic Models for Visual Signals

This photo is intentionally upside-down, so you can look at it more abstractly. The left photo has a resolution of about 500 × 500 pixels and the right photo is the yellow 40 × 40 window shown on the left. Note (a) how the distinct shapes in the road made by the large wet/dry spots gradually merge into dirt texture and (b) the way on the right the bush is pure noise. If the bush had moved relative to the pixels, the pattern would be totally diﬀerent. There is no clear dividing line between distinct objects, texture, and noise. Even worse, some road patches which ought to be texture are larger than salient objects like the dog.

Figure 2.3

random model

wavelet

of this as clutter. The real world is made up of objects plus their parts and surface markings of all sizes and any camera resolves only so many of these. There is an ideal image of inﬁnite resolution but any camera must use sensors with a positive point spread function. The theorem above says that this ideal image, because it carries all this detail, cannot even be a function. For example, it has more and more high-frequency content as the sensors are reﬁned and its total energy diverges in the limit,6 hence it cannot be in L2 . In ﬁg 2.3, we illustrate that there is no clear dividing line between objects, texture, and noise: depending on the scale at which you view and digitize the ideal image, the same “thing” may appear as an object, as part of a texture, or as just a tiny bit of noise. This continuum has been analyzed beautifully recently by Wu et al. (2006, in revision). Is there is a simple stochastic model for images which incorporates both high kurtosis and scale invariance? There is a unique scale-invariant Gaussian model, namely colored white noise whose expected power spectrum conforms to the 1/f 2 law. But this has kurtosis equal to 3. The simplest model with both properties seems to be that proposed and studied by Gidas and me (Gidas and Mumford, 2001), which we call the random wavelet model. In this model, a random image is a countable sum: ψα (erα x − xα , erα y − yα ). I(x, y) = α

2.1

Statistics of the Image Alone

31

Here (rα , xα , yα ) is a uniform Poisson process in 3-space and ψα are samples from the auxiliary Levy process, a distribution on the space of scale- and positionnormalized elementary image constituents, which one may call mother wavelets or textons. These expansions converge almost surely in all the Hilbert-Sobolev spaces H − . Each component ψα represents an elementary constituent of the image. Typical choices for the ψ’s would be Gabor patches, edgelets or curvelets, or more complex shapes such as ribbons or simple shapes with corners. We will discuss these in section 2.1.4 and we will return to the random wavelet model in section 2.2.3. 2.1.3

Occlusion and the “Dead Leaves” Model

There is, however, a third basic aspect of image statistics which we have so far not considered: occlusion. Images are two-dimensional projections of the threedimensional world and objects get in front of each other. This means that it is a mathematical simpliﬁcation to imagine images as sums of elementary constituents. In reality, objects are ordered by distance from the lens and they should be combined by the nonlinear operation in which nearer surface patches overwrite distant ones. Statistically, this manifests itself in a strongly non-Markovian property of images: suppose an object with a certain color and texture is occluded by a nearer object. Then, on the far side of the nearer object, the more distant object may reappear, hence its color and texture have a larger probability of occurring than in a Markov model. This process of image construction was studied by the French school of Math´ eron and Serra based at the Ecole des Mines (Serra, 1983 and 1988). Their “dead leaves model” is similar to the above random wavelet expansion except that occlusion is used. We imagine that the constituents of the image are tuples (rα , xα , yα , dα , Dα , ψα ) where rα , xα and yα are as before, but now dα is the distance from the lens to the αth image patch and ψα is a function only on the set of (x, y) ∈ Dα . We make no a priori condition on the density of the Poisson process from which (rα , xα , yα , dα ) is sampled. The image is then given by I(x, y) = ψα(x,y) (erα(x,y) x − xα(x,y) , erα(x,y) y − yα(x,y) ), α(x, y) = argmin{dα (x, y) ∈ Dα }

where

This model has been analyzed by A. Lee, J. Huang and myself (Lee et al., 2001) but has more serious infrared and ultraviolet catastrophes than the additive one. One problem is that nearby small objects cause the world to be enveloped in a sort of fog occluding everything in the distance. Another is the probability that one big nearby object occludes everything. In any case, with some cutoﬀs, Lee’s models are approximately scale-invariant and seem to reproduce all the standard elementary image statistics better than any other that I know of, e.g., two-point co-occurrence statistics as well as joint wavelet statistics. Examples of both types of models are shown in ﬁg. 2.4. I believe a deeper analysis of this category of models entails modeling directly,

32

Empirical Statistics and Stochastic Models for Visual Signals

20

20

40

40

60

60

80

80

100

100

120

120

140

140

160

160

180

180 200

200 50

100

150

200

50

100

150

200

Synthetic images illustrating the generic image models from the text. On the left, a sample dead leaves model using disks as primitives; on the right, a random wavelet model whose primitive are short ribbons.

Figure 2.4

not the objects in 2D projection, but their statistics in 3D. What is evident then is that objects are not scattered in 3-space following a Poisson process, but rather are agglutinative: smaller objects collect on or near the surface of bigger objects (e.g., houses and trees on the earth, limbs and clothes on people, buttons and collars on clothes, etc.). The simplest mathematical model for this would be a random branching process in which an object had “children”, which were the smaller objects clustering on its surface. We will discuss a 2D version of this in section 2.2.3. 2.1.4

texton

The Phonemes Of Images

The ﬁnal component of this direct attack on image statistics is the investigation of its elementary constituents, the ψ above. In analogy with speech, one may call these constituents phonemes (or phones). The original proposals for such building blocks were given by Julesz and Marr. Julesz was interested in what made two textures distinguishable or indistinguishable. He proposed that one should break textures locally into textons (Julesz, 1981; Resnikoﬀ, 1989) and, supported by his psychophysical studies, he proposed that the basic textons were elongated blobs and their endpoints (“terminators”). Marr (1982), motivated by the experiments of Hubel and Wiesel on the responses of cat visual cortex neurons, proposed that one should extract from an image its “primal sketch”, consisting of edges, bars, and blobs. Linking these proposals with raw image statistics, Olshausen and Field (1996) showed that simple learning rules seeking a sparse coding of the image, when exposed to small patches from natural images, did indeed develop responses sensitive to edges, bars, and blobs. Another school of researchers have taken the elegant mathematical theory of wavelets and sought to ﬁnd those wavelets which enabled best image compression. This has been pursued especially by Mallat (1999), Simoncelli (1999), and Donoho and their collaborators (Candes and Donoho, 2005). Having large natural image databases and powerful computers, we can ask now

2.2

Grouping of Image Structures

33

for a direct extraction of these or other image constituents from a statistical analysis of the images themselves. Instead of taking psychophysical, neurophysiological, or mathematical results as a basis, what happens if we let images speak for themselves? Three groups have done this: Geman-Koloydenko (Geman and Koloydenko, 1999), Huang-Lee-Pedersen-Mumford (Lee et al., 2003a), and Malik-Shi (Malik et al., 1999). Some of the results of Huang and of Malik et al. are shown in ﬁg. 2.5. The approach of Geman and Koloydenko was based on analyzing all 3 × 3 image patches using order statistics. The same image patches were studied by Lee and myself using their real number values. A very similar study by Pedersen and Lee (2002) replaced the nine pixel values by nine Gaussian derivative ﬁlter responses. In all three cases, a large proportion of such image patches were found to be either low contrast or high contrast cut across by a single edge. This, of course, is not a surprise, but it quantiﬁes the signiﬁcance of edges in image structure. For example, in the study by Lee, Pedersen and myself, we took the image patches with the top 20% quantile for contrast, then subtracted their mean and divided by their standard deviation, obtaining data points on a seven-dimensional sphere. In this sphere, there is a surface representing the responses to image patches produced by imaging straight edges with various orientations and oﬀsets. Close analysis shows that the data is highly concentrated near this surface, with asymptotic inﬁnite density along the surface itself. Malik and Shi take small patches and analyze these by a ﬁlter bank of 36 wavelet ﬁlters. They then apply k-means clustering to ﬁnd high-density points in this point cloud. Again the centers of these clusters resemble the traditional textons and primitives. In addition, they can adapt the set of textons they derive to individual images, obtaining a powerful tool for representing a single image. A deﬁnitive analysis of images deriving directly the correct vocabulary of basic image constituents has not been made but the outlines of the answer are now clear.

2.2

Grouping of Image Structures In the analysis of signals of any kind, the most basic “hidden variables” are the labels for parts of the signal that should be grouped together, either because they are homogeneous parts in some sense or because the components of this part occur together with high frequency. This grouping process in speech leads to words and in language leads to the elements of grammar—phrases, clauses, and sentences. On the most basic statistical level, it seeks to group parts of the signal whose probability of occurring together is signiﬁcantly greater than it would be if they were independent: see section 2.2.3 for this formalism. The factors causing grouping were the central object of study for the Gestalt school of psychology. This school ﬂourished in Germany and later in Italy in the ﬁrst half of the twentieth century and included M. Wertheimer, K. Koﬀka, W. Metzger, E. Brunswik, G. Kanizsa, and many others. Their catalog of features which promoted grouping included

34

Empirical Statistics and Stochastic Models for Visual Signals

0.012

0.012

0.011

0.011

0.011

0.011

0.0088

0.0087

0.0087

0.0087

0.0087

0.0086

0.01

0.01

0.01

0.01

0.01

0.01

0.0086

0.0086

0.0086

0.0085

0.0085

0.0085

0.01

0.0099

0.0097

0.0097

0.0097

0.0097

0.0085

0.0084

0.0084

0.0084

0.0083

0.0083

0.0097

0.0096

0.0096

0.0096

0.0092

0.0092

0.0083

0.0082

0.0082

0.0082

0.0081

0.008

0.0092

0.0092

0.0092

0.0091

0.0091

0.0091

0.008

0.008

0.008

0.0079

0.0079

0.0078

0.009

0.009

0.009

0.009

0.0089

0.0089

0.0078

0.0078

0.0078

0.0078

0.0077

0.0077

0.0077

0.0077

0.0076

0.0075

0.0074

0.0073

0.0072

0.0072

0.0071

0.007

0.007

0.0069

0.0068

0.0068

0.0068

0.0068

0.0068

0.0068

0.006

0.006

0.006

0.0059

0.0059

0.0057

0.0068

0.0067

0.0067

0.0066

0.0066

0.0066

0.0057

0.0056

0.0056

0.0056

0.0053

0.0052

0.0065

0.0065

0.0065

0.0064

0.0063

0.0063

0.0052

0.0051

0.005

0.005

0.005

0.0048

0.0062

0.0062

0.0062

0.0061

0.0061

0.006

0.0046

0.003

(a)

(b)

Textons derived by k-means clustering applied to 8 × 8 image patches. On the top, Huang’s results for image patches from van Hateren’s database; on the bottom, Malik et al.’s results using single images and ﬁlter banks. Note the occasional terminators in Huang’s results, as Julesz predicted.

Figure 2.5

2.2

Grouping of Image Structures

35

color and proximity, alignment, parallelism, and symmetry, closedness and convexity. Kanizsa was well aware of the analogy with linguistic grammar, titling his last book Grammatica del Vedere (Kanizsa, 1980). But they had no quantitative measures for the strength of these grouping principles, as they well knew. This is similar to the situation for traditional theories of human language grammar—a good story to explain what words are to be grouped together in phrases but no numbers. The challenge we now face is to create theories of stochastic grammars which can express why one grouping is chosen in preference to another. It is a striking fact that, faced either with a sentence or a scene of the world, human observers choose the same groupings with great consistency. This is in contrast with computers which, given only the grouping rules, ﬁnd thousands of strange parses of both sentences and images. 2.2.1 grouping

The Most Basic Grouping: Segmentation and Texture

The simplest grouping rules are those of similar color (or brightness) and proximity. These two rules have been used to attack the segmentation problem. The most naive but direct approach to image segmentation is based on the assumption that images break up into regions on which their intensity values are relatively constant and across whose boundaries those values change discontinuously. A mathematical version of this approach, which gives an explicit measure for comparing diﬀerent proposed segmentations, is the energy functional proposed by Shah and myself (Mumford and Shah, 1989). It is based on a model I = u + v where u is a simpliﬁed cartoon of the image and v is “noise”:

∇u 2 + C3 · length(Γ), where E(I, u, Γ) = C1 (I − u)2 + C2 D

D−Γ

D = domain of I, Γ = boundaries of regions which are grouped together, and Ci = parameters to be learned. In this model, pixels in D−Γ have been grouped together by stringing together pairs of nearby similarly colored pixels. Diﬀerent segmentations correspond to choosing diﬀerent u and Γ and the one with lower energy is preferred. Using the Gibbs statistical mechanics approach, this energy can be thought of as a probability: heuristically, we set p(I, u, Γ) = e−E(I,u,Γ)/T /Z, where T and Z are constants. Taking this point of view, the ﬁrst term in E is equivalent to assuming v = I − u is a sample from white noise. Moreover, if Γ is ﬁxed, then the second term in E makes u a sample from the scale-invariant Gaussian distribution on functions, suitably adapted to the smaller domain D − Γ. It is hard to interpret the third term even heuristically, although Brownian motion ((x(t), y(t)) is heuristically a sample

36

Empirical Statistics and Stochastic Models for Visual Signals

2

2

from the prior e− (x (t) +y (t) )dt , which, if we adopt arc length parameterization, becomes e−length(Γ) . If we stay in the discrete pixel setting, the Gibbs model corresponding to E makes good mathematical sense; it is a variant of the Ising model of statistical mechanics (Blake and Zisserman, 1987; Geman and Geman, 1984). The most obvious weakness in this model is its failure to group similarly textured regions together. Textural segmentation is an example of the hierarchical application of gestalt rules: ﬁrst the individual textons are grouped by having similar colors, orientations, lengths, and aspect ratios. Then these groupings of textons are further grouped into extended textured regions with homogeneous or slowly varying “texture”. Ad hoc adaptations of the above energy approach to textural grouping (Geman and Graﬃgne, 1986; Hofmann et al., 1998; Lee et al., 1992) have been based on choosing some ﬁlter bank the similarity of whose responses are taken as a surrogate for the ﬁrst low-level texton grouping. One of the problems of this approach is that textures are often not characterized so much by an average of all ﬁlter responses as by the very large response of one particular ﬁlter, especially by the outliers occurring when this ﬁlter precisely matches a texton (Zhu et al., 1997). A careful and very illuminating statistical analysis of the importance of color, textural, and edge features on grouping, based on human segmented images, was given by Malik’s group (Foulkes et al., 2003). 2.2.2

Extended Lines and Occlusion

The most striking demonstrations of gestalt laws of grouping come from occlusion phenomena, when edges disappear behind an object and reappear. A typical example is shown in ﬁg 2.6. The most famous example is the so-called Kanizsa triangle, where, to further complicate matters, the foreground triangle has the same color as the background with only black circles of intermediate depth being visible. The grouping laws lead one to infer the presence of the occluding triangle and the completion of the three partially occluded black circles. An amusing variant, the Kanizsa pear, is shown in the same ﬁgure. These eﬀects are not merely psychophysical curiosities. Virtually every image of the natural world has major edges which are occluded one or more times by foreground objects. Correctly grouping these edges goes a long way to ﬁnding the correct parse of an image. A good deal of modeling has gone into the grouping of disconnected edges into extended edges and the evaluation of competing groupings by energy values or probabilities. Pioneering work was done by Parent and Zucker (1989) and Shashua and Ullman (1988). Nitzberg, Shiota, and I proposed a model for this (Nitzberg et al., 1992) which was a small extension of the Mumford-Shah model. The new energy involves explicitly the overlapping regions Rα in the image given by the 3D objects in the scene, both the visible and the occluded parts of these objects. Therefore, ﬁnding its minimum involves inferring the occluded parts of the visible objects as well as the boundaries of their visible parts. (These are literally “hidden

2.2

Grouping of Image Structures

37

Two examples of gestalt grouping laws: on the left, the black bars are continued under the white blob to form the letter T, on the right, the semicircles are continued underneath a foreground “pear” which must completed by contours with zero contrast.

Figure 2.6

variables”.) Moreover, we need the depth order of the objects—which are nearer, which farther away. The cartoon u of the image is now assumed piecewise constant, with value uα on the region Rα . Then,

2 C2 κ2∂Rα + C3 ds, C1 (I − uα ) + E(I, {uα }, {Rα }) = ⎛α Rα = ⎝Rα −

Rα

∂Rα

⎞

Rα ∩ Rβ ⎠ = visible part of Rα ,

nearer Rβ

κ∂Rα = curvature of ∂Rα . This energy allows one to quantify the application of gestalt rules for inferring occluded objects and predicts correctly, for example, the objects present in the Kanizsa triangle. The minima of this E will infer speciﬁc types of hidden contours, namely contours which come from the purely geometric variational problem of minimizing a sum of squared curvature and arc length along an unknown curve. This variational problem was ﬁrst formulated by Euler, who called the resulting curves elastica. To make a stochastic model out of this, we need a stochastic model for the edges occurring in natural images. There are two parts to this: one is modeling the local nature of edges in images and the other is modeling the way they group into extended curves. Several very simple ideas for modeling curves locally, based on Brownian motion, were proposed in Mumford (1992). Brownian paths themselves are too jagged to be suitable, but one can assume the curves are C 1 and that their orientation θ(s), as a function of arc length, is Brownian. Geometrically, this is like saying their curvature is white noise. Another alternative is to take 2D projections of 3D curves whose direction of motion, given by a map from arc length to points

38

Empirical Statistics and Stochastic Models for Visual Signals

on the unit sphere, is Brownian. Such curves have more corners and cusps, where the 3D path heads toward or away from the camera. Yet another option is to generate parameterized curves whose velocity (x (t), y (t)) is given by two OrnsteinUhlenbeck processes (Brownian functions with a restoring force pulling them to 0). These paths have nearly straight segments when the velocity happens to get large. A key probability distribution in any such theory is p(x, y, θ), the probability density that if an image contour passes through (0, 0) with horizontal tangent, then this contour will also pass through (x, y) with orientation θ. This function has been estimated from image databases in (Geisler et al., 2001), but I do not know of any comparison of their results with mathematical models. Subsequently, Zhu (1999) and Ren and Malik (2002) directly analyzed edges and their curvature in hand-segmented images. Zhu found a high-kurtosis empirical distribution much like ﬁlter responses: a peak at 0 showing the prevalence of straight edges and large tails indicating the prevalence of corners. He built a stochastic model for polygonal approximations to these curves using an exponential model of the form p(Γ) ∝ e−

Γ

ψ1 (κ(s))+ψ2 (κ (s))ds

,

where κ is the curvature of Γ and the ψi are unknown functions chosen so that the model yields the same distribution of κ, κ as that found in the data. Finding continuum limits of his models under weak convergence is an unsolved problem. Ren and Malik’s models go beyond the previous strictly local ones. They are k th order Markov models in which the orientation θk+1 of a curve at a sample point Pk+1 is a sample from a joint probability distribution of the orientations θkα of both the curve and smoothed versions of itself at other scales α, all at the previous point Pk . A completely diﬀerent issue is ﬁnding probabilities that two edges should be joined, e.g., if Γ1 , Γ2 are two curves ending at points P1 , P2 , how likely is it that in the real world there is a curve Γh joining P1 and P2 and creating a single curve Γ1 ∪ Γh ∪ Γ2 ? This link might be hidden in the image because of either occlusion, noise or low contrast (anyone with experience with real images will not be surprised at how often this happens). Jacobs, Williams, Geiger, and others have developed algorithms of this sort based on elastica and related ideas (Geiger et al., 1998; Williams and Jacobs, 1997). Elder and Goldberg (2002) and Geisler et al. (2001) have carried out psychophysical experiments to determine the eﬀects of proximity, orientation diﬀerence, and edge contrast on human judgments of edge completions. One of the subtle points here (as Ren and Malik make explicit) is that this probability does not depend only on the endpoints Pi and the tangent lines to the Γi at these points. So, for instance, if Γ1 is straight for a certain distance before its endpoint P1 , then the longer this straight segment is, the more likely it is that any continuation it has will also be straight. An elegant analysis of the situation purely for straight edges has been given by Desolneux et al. (2003). It is based on what they call maximally meaningful alignments, which come from computing

2.2

Grouping of Image Structures

39

Figure 2.7 An experiment ﬁnding the prostate in a MRI scan (from August (2002)). On the left, the raw scan; in the middle, edge ﬁlter responses; on the right, the computed posterior of August’s curve indicator random ﬁeld, (which actually lives in (x, y, θ) space, hence the boundary of the prostate is actually separated from the background noise).

the probabilities of accidental alignments and no other prior assumptions. The most compelling analysis of the problem, to my mind, is that in the thesis of Jonas August (August, 2001). He starts with a prior on a countable set of true curves, assumed to be part of the image. Then he assumes a noisy version of this is observed and seeks the maximally probable reconstruction of the whole set of true curves. An example of his algorithms is shown in ﬁg 2.7. Another algorithm for global completion of all image contours has been given recently by Malik’s group (Ren et al., 2005). 2.2.3

Mathematical Formalisms for Visual Grammars

The “higher level” Gestalt rules for grouping based on parallelism, symmetry, closedness, and convexity are even harder to make precise. In this section, I want to describe a general approach to these questions. So far, we have described grammars loosely as recursive groupings of parts of a signal, where the signal can be a string of phonemes or an image of pixels. The mathematical structure which these groupings deﬁne is a tree: each subset of the domain of the image which is grouped together deﬁnes a node in this tree and, whenever one such group contains another, we join the nodes by an edge. In the case of sentences in human languages, this tree is called the parse tree. In the case of images, it is similar to the image pyramid made up of the pixels of the image plus successively “blowndown” images 2n times smaller. However, unlike the image pyramid, its nodes only stand for natural groupings, so its structure is adaptively determined by the image itself. To go deeper into the formalism of grammar, the next step is to label these groupings. In language, typical labels are “noun phrase”, “prepositional clause,” etc. In images, labels might be “edgelet,” “extended edge,” “ribbon,” “T-junction,” or

40

Empirical Statistics and Stochastic Models for Visual Signals

The parse tree for the letter A which labels the top node; the lower nodes might be labeled “edge” and “corner.” Note that in grouping the two sides, the edge has an attribute giving its length and approximate equality of the lengths of the sides must hold; and in the ﬁnal grouping, the bar of the A must meet the two sides in approximately equal angles. These are probabilistic constraints involving speciﬁc attributes of the constituents, which must be included in B .

Figure 2.8

even “the letter A.” Then the grouping laws are usually formulated as productions: noun phrase −→ determiner + noun extended edge −→ edgelet + extended edge

probabilistic context-free grammar

where the group is on the left and its constituents are shown on the right. The second rule creates a long edge by adding a small piece, an edgelet, to one end. But now the issue of agreement surfaces: one can say “a book” and “some books” but not “a books” or “some book.” The determiner and the noun must agree in number. Likewise, to group an edge with a new edgelet requires that the edgelet connect properly to the edge: where one ends, the other must begin. So we need to endow our labeled groupings with a list of attributes that must agree for the grouping to be possible. So long as we can do this, we have created a context-free grammar. Context-freeness means that the possibility of the larger grouping depends only on the labels and attributes of the constituents and nothing else. An example of the parse of the letter A is shown in ﬁg 2.8. We make the above into a probability model in a top-down generative fashion by assigning probabilities to each production. For any given label and attributes, the sum (or integral) of the probabilities of all possible productions it can yield should be 1. This is called a PCFG (probabilistic context-free grammar) by linguists. It is the same as what probabilists call a random branching tree (except that grammars are usually assumed to almost surely yield ﬁnite parse trees). A more general formalism for deﬁning random trees with random data attached to their nodes has been given by Artur Fridman (Fridman, 2003). He calls his models mixed Markov models because some of the nodes carry address variables whose value is the index of another node. Thus in each sample from the model, this node adds a new edge to the graph. His models include PCFGs as a special case.

2.2

Grouping of Image Structures

41

A simpliﬁcation of the parse tree inferred by the segmentation algorithm of Galun et al. (2003). The image is at the bottom and part of its tree is shown above it. On the right are shown some of the regions in the image, grouped by successive levels of the algorithm.

Figure 2.9

Random trees can be ﬁt naturally into the random wavelet model (or the dead leaves model) described above. To see this, we consider each 4-tuple {xα , yα , rα , ψα } in the model not merely as generating one elementary constituent of the image, but as the root of a whole random branching tree. The child nodes it generates should add parts to a now compound object, expanding the original simple image constituent ψα . For example the root might be an elongated blob representing the trunk of a person and the tree it generates would add the limbs, clothes, face, hands, etc., to the person. Or the root might be a uniform patch and the tree would add a whole set of textons to it, making it into a textured patch. So long as the rate of growth of the random branching tree is not too high, we still get a scale-invariant model. Two groups have implemented image analysis programs based on computing such trees. One is the multiscale segmentation algorithm of Galun, Sharon, Basri, and Brandt (Galun et al., 2003), which produces very impressive segmentation results. The method follows Brandt’s adaptive tree-growing algorithm called algebraic multi-grid. In their code, texture and its component textons play the same role as objects and their component parts: each component is identiﬁed at its natural scale and grouped further at a higher level in a similar way (see ﬁg 2.9). Their code is fully scale-invariant except at the lowest pixel level. It would be very interesting to ﬁt their scheme into the Bayesian framework. The other algorithm is an integrated bottom-up and top-down image parsing

42

Empirical Statistics and Stochastic Models for Visual Signals

program from Zhu’s lab (Tu et al., 2003). The output of their code is a tree with semantically labeled objects at the top, followed by parts and texture patches in the middle with the pixels at the bottom. This program is based on a full stochastic model. A basic problem with this formalism is that it is not suﬃciently expressive: the grammars of nature appear to be context sensitive. This is often illustrated by contrasting languages that have sentences of the form abcddcba, which can be generated recursively by a small set of productions as in s → asa → absba → abcscba → abcddcba, versus languages which have sentences of the form abcdabcd, with two complex repeating structures, which cannot be generated by simple productions. Obviously, images with two identical faces are analogs of this last sentence. Establishing symmetry requires you to reopen the grouped package and examine everything in it to see if it is repeated! Unless you imagine each label given a huge number of attributes, this cannot be done in a context-free setting. In general, two-dimensional geometry creates complex interactions between groupings, and the strength of higher-order groupings seems to always depend on multiple aspects of each piece. Take the example of a square. Ingredients of the square are (1) the two groupings of parallel edges, each made up of a pair of parallel sides of equal length and (2) the grouping of edgelets adjacent to each vertex into a “right-angle” group. The point is that the pixels involved in these smaller groupings partially intersect. In PCFGs, each group should expand to disjoint sets of primitives or to one set contained in another. The case of the square is best described with the idea of graph uniﬁcation, in which a grouping rule uniﬁes parts of the graph of parts under each constituent. S. Geman and his collaborators (Bienenstock et al., 1998; Geman et al., 2002) have proposed a general framework for developing such probabilistic contextsensitive grammars. He proposes that for grouping rule , in which groups y1 , y2 , · · · , yk are to be uniﬁed into a larger group x, there is a binding function B (y1 , y2 , · · · , yk ) which singles out those attributes of the constituents that aﬀect the probability of making the k-tuple of y’s into an x. For example, to put two edgelets together, we need to ask if the endpoint of the ﬁrst is near the beginning of the second and whether their directions are close. The closer are these points and directions, the more likely it is that the two edgelets should be grouped. The basic hypothesis is that the likelihood ratio p(x, y1 , · · · , yk )/ i p(yi ) depends only on B (y1 , · · · , yk ). In their theory, Geman and colleagues analyze how to compute this function from data. This general framework needs to be investigated in many examples to further constrain it. An interesting example is the recent work of Ullman and collaborators (Ullman et al., 2002) on face recognition, built up through the recognition of parts: this would seem to ﬁt into this framework. But, overall, the absence of mathematical theories which incorporate all the gestalt rules at once seems to me the biggest gap in our understanding of images.

2.3

2.3

Probability Measures on the Space of Shapes

43

Probability Measures on the Space of Shapes The most characteristic new pattern found in visual signals, but not in onedimensional signals, are shapes, two-dimensional regions in the domain of the image. In auditory signals, one has intervals on which the sound has a particular spectrum, for instance, corresponding to some speciﬁc type of source (for phonemes, some speciﬁc conﬁguration of the mouth, lips, and tongue). But an interval is nothing but a beginning point and an endpoint. In contrast, a subset of a two-dimensional region is much more interesting and conveys information by itself. Thus people often recognize objects by their shape alone and have a rich vocabulary of diﬀerent categories of shapes often based on prototypes (heart-shaped, egg-shaped, starshaped, etc.). In creating stochastic models for images, we must face the issue of constructing probability measures on the space of all possible shapes. An even more basic problem is to construct metrics on the space of shapes, measures for the dissimilarity of two shapes. It is striking how people ﬁnd it quite natural to be asked if some new object has a shape similar to some old object or category of objects. They act as though they carried a clear-cut psychophysical metric in their heads, although, when tested, their similarity judgments show a huge amount of context sensitivity. 2.3.1

manifold

The Space of Shapes and Some Basic Metrics on It

What do we mean by the space of shapes? The idea is simply to deﬁne this space as the set of 2-dimensional shapes, where a shape is taken to mean an open subset S ⊂ R2 with smooth boundary7 . We let S denote this set of shapes. The mathematician’s approach is to ask: what structure can we give to S to endow it with a geometry? In particular, we want to deﬁne (1) local coordinates on S, so that it is a manifold, (2) a metric on S, and (3) probability measures on S. Having probability measures will allow us to put shapes into our theory as hidden variables and extend the Bayesian inference machinery to include inferring shape variables from images. S itself is not a vector space: one cannot add and subtract two shapes in a way satisfying the usual laws of vectors. Put another way, there is, no obvious way to put global coordinates on S, that is to create a bijection between points of S and points in some vector space. One can, e.g. describe shapes by their Fourier coeﬃcients, but the Fourier coeﬃcients coming from shapes will be very special sequences of numbers. What we can do, however, is put a local linear structure on the space of shapes. This is illustrated in ﬁg 2.10. Starting from one shape S, we erect normal lines at each point of the boundary Γ of S. Then nearby shapes will have boundaries which intersect each normal line in a unique point. Suppose ψ(s) ∈ R2 is arc-length parameterization of Γ. Then the unit normal vector is given by n(s) = ψ ⊥(s) and each nearby curve is parameterized uniquely in the form ψa (s) = ψ(s) + a(s) · n(s),

for some function a(s).

44

Empirical Statistics and Stochastic Models for Visual Signals

Figure 2.10 The manifold structure on the space of shapes is here illustrated: all curves near the heavy one meet the normal “hairs” in a unique point, hence are described by a function, namely, how far this point has been displaced normally.

tangent space

All smooth functions a(s) which are suﬃciently small can be used, so we have created a bijection between an open set of functions a, that is an open set in a vector space, and a neighborhood of Γ ∈ S. These bijections are called charts and on overlaps of such charts, one can convert the a’s used to describe the curves in one chart into the functions in the other chart: this means we have a manifold. For details, see the paper (Michor and Mumford, 2006). Of course, the function a(s) lies in an inﬁnite-dimensional vector space, so S is an inﬁnite-dimensional manifold. But that is no deterrent to its having its own intrinsic geometry. Being a manifold means S has a tangent space at each point S ∈ S. This tangent space consists in the inﬁnitesimal deformations of S, i.e., those coming from inﬁnitesimal a(s). Dropping the , the inﬁnitesimal deformations may be thought of simply as normal vector ﬁelds to Γ, that is, the vector ﬁelds a(s) · n(s). We denote this tangent space as TS,S . How about metrics? In analysis, there are many metrics on spaces of functions and they vary in two diﬀerent ways. One choice is whether you make a worst-case analysis or an average analysis of the diﬀerence of two functions—or something in between. This means you deﬁne the diﬀerence of two functions a and b either as the supx |a(x) − b(x)|, the integral |a(x) − b(x)|dx, or as an Lp norm, ( |a(x) − b(x)|p dx)1/p (which is in between). The case p = ∞ corresponds to the sup, and p = 1 to the average. Usually, the three important cases8 are p = 1, 2, or ∞. The other choice is whether to include derivatives of a, b as well as the values of a, b in the formula for the distance and, if so, up to what order k. These distinctions carry

2.3

Probability Measures on the Space of Shapes

45

Figure 2.11 Each of the shapes A, B, C, D, and E is similar to the central shape, but in diﬀerent ways. Diﬀerent metrics on the space of shape bring out these

distinctions (adapted from B. Kimia). over to shapes. The best-known measures are the so-called Hausdorﬀ measure, d∞,0 (S, T ) = max sup inf x − y , sup inf x − y , x∈S y∈T

y∈T x∈S

for which p = ∞, k = 0, and the area metric, d1,0 (S, T ) = Area(S − S ∩ T ) ∪ Area(T − S ∩ T ), for which p = 1, k = 0. It is important to realize that there is no one right metric on S. Depending on the application, diﬀerent metrics are good. This is illustrated in ﬁg 2.11. The central bow-tie-like shape is similar to all the shapes around it. But diﬀerent metrics bring out their dissimilarities and similarities in each case. The Hausdorﬀ metric applied to the outsides of the shapes makes A far from the central shape; any metric using the ﬁrst derivative (i.e., the orientation of the tangent lines to the boundary) makes B far from the central shape; a sup-type metric with the second derivative (i.e., the curvature of the boundary) makes C far from the central shape, as curvature becomes inﬁnite at corners; D is far from the central shape in the area metric; E is far in all metrics, but the challenge is to ﬁnd a metric in which it is close to the central shape. E has “outliers,” the spikes, but is identical to the central shape if they can be ignored. To do this needs what are called robust metrics of which the simplest example is L1/2 (not a true metric at all). 2.3.2 Riemannian metrics

Riemannian Metrics and Probability Measures via Diﬀusion

There are great mathematical advantages to using L2 , so-called Riemannian metrics. More precisely, a Riemannian metric is given by deﬁning a quadratic inner product in the tangent space TS,S . In Riemannian settings, the unit balls are nice

46

Empirical Statistics and Stochastic Models for Visual Signals

and round and extremal problems, such as paths of shortest length, are usually well posed. This means we can expect to have geodesics, optimal deformations of one shape S to a second shape T through a family St of intermediate shapes, i.e., we can morph S to T in a most eﬃcient way. Having geodesics, we can study the geometry of S, for instance whether its geodesics diverge or converge9 —which depends on the curvature of S in the metric. But most important of all, we can deﬁne diﬀusion and use this to get Brownian paths and thus probability measures on S. A most surprising situation arises here: there are three completely diﬀerent ways to deﬁne Riemannian metrics on S. We need to assign a norm to normal vector ﬁelds a(s)n(s) along a simple closed plane curve Γ. local metric In inﬁnitesimal metric, the norm is deﬁned as an integral along Γ. In general, this can be any expression

2 F (a(s), a (s), a (s), · · · , κ(s), κ (s), · · · )ds, a = Γ

involving a function F quadratic in a and the derivatives of a whose coeﬃcients can possibly be functions associated to Γ like the curvature and its derivatives. We call these local metrics. We might have F = a(s)2 or F = (1 + Aκ2 (s)) · a(s)2 , where A is a constant; or F = a(s)2 + Aa (s)2 , etc. These metrics have been studied by Michor and Mumford (Michor and Mumford, 2006, 2005). Globally, the distance between two shapes is then

1 ∂St inf dt, d(S0 , S1 ) = ∂t paths {St } 0 where ∂St /∂t is the normal vector ﬁeld given by this path. diﬀeomorphism

In other situations, a morph of one shape to another needs to be considered as part of a morph of the whole plane. For this, the metric should be a quotient of a metric on the group G of diﬀeomorphisms of R2 , with some boundary condition, e.g., equal to the identity outside some large region. But an inﬁnitesimal diﬀeomorphism is just a vector ﬁeld v on R2 and the induced inﬁnitesimal deformation of Γ is given by a(s) = (v ·n(s)). Let V be the vector space of all vector ﬁelds on R2 , zero outside some large region. Then this means that the norm on a is

2 inf F (v , vx , vy , · · · )dxdy, a = v ∈V,( v · n)=a

Miller’s metric

R2

where we deﬁne an inner product on V using a symmetric positive deﬁnite quadratic expression in v and its partial derivatives. We might have F = v 2 or F = v 2 + A vx 2 + A vy 2 , etc. It is convenient to use integration by parts and write all such F ’s as (Lv , v ), where L is a positive deﬁnite partial diﬀerential operator (L = I − A in the second case above). These metrics have been studied by Miller, Younes, and their many collaborators (Miller, 2002; Miller and Younes, 2001) and applied extensively to the subject they call computational anatomy, that is, the analysis of medical scans by deforming them to template anatomies. Globally, the

2.3

Probability Measures on the Space of Shapes

47

A diﬀusion on the space of shapes in the Riemannian metric of Miller et al. The shapes should be imagined on top of each other, the translation to the right being added in order that each shape can be seen clearly. The diﬀusion starts at the unit circle. Figure 2.12

distance between two shapes is then 1/2

1 ∂φ −1 ◦ φ )dxdy F( dt, where dMiller (S, T ) = inf φ ∂t R2 0 φ(t), 0 ≤ t ≤ 1 is a path in G, φ(0) = I, φ(1)(S) = T.

Weil-Petersen metric

Finally, there is a remarkable and very special metric on S¯ = S modulo translations and scalings (i.e., one identiﬁes any two shapes which diﬀer by translation plus a scaling). It is derived from complex analysis and known as the Weil-Petersen (or WP) metric. Its importance is that it makes S¯ into a homogeneous metric space, that is, it has everywhere the same geometry. There is a group of global maps of S to itself which preserve distances in this metric and which can take any shape S to any other shape T . This is not the case with the previous metrics, hence the WP metric emerges as the analog of the standard Euclidean distance in ﬁnite dimensions. The deﬁnition is more elaborate and we do not give it here, see the paper (Mumford and Sharon, 2004). This metric also has negative or zero curvature in all directions and hence ﬁnite sets of shapes as well as probability measures on G¯ should always have a well-deﬁned mean (minimizing the sum of squares of distances) in this metric. Finally, this metric is closely related to the medial axis, which has been frequently used for shape classiﬁcation. The next step in each of these theories is to investigate the heat kernel, the solution of the heat equation starting at a delta function. This important question has not been studied yet. But diﬀusions in these metrics are easy to simulate. In ﬁg 2.12 we show three random walks in S in one of Miller’s metrics. The analog of Gaussian distributions are the probability measures gotten by stopping diﬀusion at a speciﬁc point in time. And analogs of the scale mixtures of Gaussians discussed above are obtained by using a so-called random stopping time, that is, choosing the time to halt the diﬀusion randomly from another probability distribution. It seems clear that one or more of these diﬀusion measures are natural general-purpose priors on the space of shapes.

48

Empirical Statistics and Stochastic Models for Visual Signals

2.3.3

Finite Approximations and Some Elementary Probability Measures

A completely diﬀerent approach is to infer probability measures directly from data. Instead of seeking general-purpose priors for stochastic models, one seeks special-purpose models for speciﬁc object-recognition tasks. This has been done by extracting from the data a ﬁnite set of landmark points, homologous points which can be found on each sample shape. For example, in 3 dimensions, skulls have long been compared by taking measurements of distances between classical landmark points. In 2 dimensions, assuming these points are on the boundary of the shape, the inﬁnite dimensional space S is replaced by the ﬁnite dimensional space of the polygons {P1 , · · · , Pk } ∈ R2k formed by these landmarks. But, if we start from images, we can allow the landmark points to lie in the interior of the shape also. This approach was introduced a long time ago to study faces. More speciﬁcally, it was used by Cootes et al. (1993) and by Hallinan et al. (1999) to ﬁt multidimensional Gaussians to the cloud of points in R2k formed from landmark points on each of a large set of faces. Both groups then apply principal component analysis (PCA) and ﬁnd the main directions for face variation. However, it seems unlikely to me that Gaussians can give a very good ﬁt. I suspect rather that in geometric situations as well, one will encounter the high kurtosis phenomenon, with geometric features often near zero but, more often than for Gaussian variables, very large too. A ﬁrst attempt to quantify this point of view was made by Zhu (1999). He took a database of silhouettes of four-legged animals and he computed landmark points, medial axis, and curvature for each silhouette. Then he ﬁt a general exponential model to a set of six scalar variables describing this geometry. The strongest test of whether he has captured some of their essential shape properties is to sample from the model he gets. The results are shown in ﬁg 2.13. It seems to me that these models are getting much closer to the sort of special-purpose prior that is needed in object-recognition programs. Whether his models have continuum limits and of what sort is an open question. There are really three goals for a theory of shapes adapted to the analysis of images. The ﬁrst is to understand better the global geometry of S and which metrics are appropriate in which vision applications. The second is to create the best general-purpose priors on this space, which can apply to arbitrary shapes. The third is to mold special-purpose priors to all types of shapes which are encountered frequently, to express their speciﬁc variability. Some progress has been made on all three of these but much is left to be done.

2.4

Summary Solving the problem of vision requires solving three subproblems: ﬁnding the right classes of stochastic models to express accurately the variability of visual patterns in nature, ﬁnding ways to learn the details of these models from data, and ﬁnding ways to reason rapidly using Bayesian inference on these models. This chapter has

NOTES

49

Six “animals” that never existed: they are random samples from the prior of S. C. Zhu trained on real animal silhouettes. The interior lines come from his use of medial axis techniques to generate the shapes.

Figure 2.13

addressed the ﬁrst. Here a great deal of progress has been made but it must be said that much remains to be done. My own belief is that good theories of groupings are the biggest gap. Although not discussed in this article, let me add that great progress has been made on the second and third problems with a large number of ideas, e.g., the expectation maximization (EM) algorithm, much faster Monte Carlo algorithms, maximum entropy (MaxEnt) methods to ﬁt exponential models, Bayesian belief propagation, particle ﬁltering, and graph-theoretic techniques.

Notes

1 The chief mistake of the Greeks was their persistent belief that the eye must emit some sort of ray in order to do something equivalent to touching the visible surfaces. 2 This is certainly biologically unrealistic. Life requires rapid analysis of changing scenes. But this article, like much of vision research, simpliﬁes its analysis by ignoring time. 3 It is the second idea that helps to explain why aerial photographs also show approximate scale invariance. 4 The infrared divergence is readily solved by considering images mod constants. If the pixel values are log of the photon energy, this constant is an irrelevant gain factor. 5 Some have found an especially large concentration near 1/f 1.8 or 1/f 1.9 , especially for forest scenes (Ruderman and Bialek). 6 Scale invariance implies that its expected power at spatial frequency (ξ, η) is a constant times 1/(ξ 2 + η 2 ) and integrating this over (ξ, η) gives ∞. 7 A set S of points is open if S contains a small disk of points around each point x ∈ S. Smooth means that it is a curve that is locally a graph of a function with inﬁnitely many derivatives; in many applications, one may want to include shapes with corners. We simplify the discussion here and assume there are no corners. 8 Charpiat et al., however, have used p-norm as for p 1 in order to “tame” L∞ norms. 9 This is a key consideration when seeking means to clusters of ﬁnite sets of shapes and in seeking principal components of such clusters.

3

The Machine Cocktail Party Problem

Simon Haykin and Zhe Chen

cocktail problem

party

machine cocktail party problem

Imagine you are in a cocktail party environment with background music, and you are participating in a conversation with one or more of your friends. Despite the noisy background, you are able to converse with your friends, switching from one to another with relative ease. Is it possible to build an intelligent machine that is able to perform like yourself in such a noisy environment? This chapter explores such a possibility. The cocktail party problem (CPP), ﬁrst proposed by Colin Cherry, is a psychoacoustic phenomenon that refers to the remarkable human ability to selectively attend to and recognize one source of auditory input in a noisy environment, where the hearing interference is produced by competing speech sounds or various noise sources, all of which are usually assumed to be independent of each other (Cherry, 1953). Following the early pioneering work (Cherry, 1953, 1957, 1961; Cherry and Taylor, 1954), numerous eﬀorts have been dedicated to the CPP in diverse ﬁelds: physiology, neurobiology, psychophysiology, cognitive psychology, biophysics, computer science, and engineering.1 Over half a century after Cherry’s seminal work, however, it is fair to say that a complete understanding of the cocktail party phenomenon is still missing, and the story is far from being complete; the marvelous auditory perception capability of human beings remains enigmatic. To unveil the mystery and thereby imitate human performance by means of a machine, computational neuroscientists, computer scientists, and engineers have attempted to view and simplify this complex perceptual task as a learning problem, for which a tractable computational solution is sought. An important lesson learned from the collective work of all these researchers is that in order to imitate a human’s unbeatable audition capability, a deep understanding of the human auditory system is crucial. This does not mean that we must duplicate every aspect of the human auditory system in solving the machine cocktail party problem, hereafter referred to as the machine CPP for short. Rather, the challenge is to expand on what we know about the human auditory system and put it to practical use by exploiting advanced computing and signal-processing technologies (e.g., microphone arrays, parallel computers, and VLSI chips). An eﬃcient and eﬀective solution to the machine CPP will not only be a major accomplishment in its own right, but it will also have a direct impact on ongoing research in artiﬁcial intelligence (such as robotics)

52

The Machine Cocktail Party Problem

and human-machine interfaces (such as hearing aids); and these lines of research will, in their own individual ways, further deepen our understanding of the human brain. There are three fundamental questions pertaining to the CPP: 1. What is the cocktail party problem? 2. How does the brain solve it? 3. Is it possible to build a machine capable of solving it in a satisfactory manner? The ﬁrst two questions are human oriented, and mainly involve the disciplines of neuroscience, cognitive psychology, and psychoacoustics; the last question is rooted in machine learning, which involves computer science and engineering disciplines. While these three issues are equally important, this chapter will focus on the third question by addressing a solution to the machine CPP. To understand the CPP, we may identify three underlying neural processes:2 Analysis: The analysis process mainly involves segmentation or segregation, which refers to the segmentation of an incoming auditory signal to individual channels or streams. Among the heuristics used by a listener to do the segmentation, spatial location is perhaps the most important. Speciﬁcally, sounds coming from the same location are grouped together, while sounds originating from other diﬀerent directions are segregated. Recognition: The recognition process involves analyzing the statistical structure of essential patterns contained in a sound stream. The goal of this process is to uncover the neurobiological mechanisms through which humans are able to identify a segregated sound from multiple streams with relative ease. Synthesis: The synthesis process involves the reconstruction of individual sound waveforms from the separated sound streams. While synthesis is an important process carried out in the brain, the synthesis problem is of primary interest to the machine CPP. From an engineering viewpoint, we may, in a loose sense, regard synthesis as the inverse of the combination of analysis and recognition in that synthesis attempts to uncover relevant attributes of the speech production mechanism. Note also that, insofar as the machine CPP is concerned, an accurate synthesis does not necessarily mean having solved the analysis and recognition problems, although additional information on these two problems might provide more hints for the synthesis process. Bearing in mind that the goal of solving the machine CPP is to build an intelligent machine that can operate eﬃciently and eﬀectively in a noisy cocktail party environment, we propose a computational framework for active audition that has the potential to serve this purpose. To pave the way for describing this framework, we will discuss the important aspects of human auditory scene analysis and computational auditory scene analysis. Before proceeding to do so, however, some historical notes on the CPP are in order.

3.1

3.1

Some Historical Notes

53

Some Historical Notes In the historical notes that follow, we do two things. First, we present highlights of the pioneering experiments performed by Colin Cherry over half a century ago, which are as valid today as they were then; along the way, we also refer to the other related works. Second, we highlight three machine learning approaches: independent component analysis, oscillatory correlation, and cortronic processing, which have been motivated by the CPP in one form or another. 3.1.1

Cherry’s Early Experiments

In the early 1950s, Cherry became interested in the remarkable hearing capability of human beings in a cocktail party environment. He himself raised several questions: What is our selective attention ability? How are we able to select information coming from multiple sources? Some information is still retained even when we pay no attention to it; how much information is retained? To answer these fundamental questions, Cherry (1953) compared the ability of listeners to attend to two diﬀerent spoken messages under diﬀerent scenarios. In his classic experimental set-up called dichotic listening, the recorded messages were mixed and presented together to the same ear of a subject over headphones, and the listeners were requested to test the intelligibility3 of the message and repeat each word of the message to be heard, a task that is referred to as shadowing. In the cited paper, Cherry reported that when one message is delivered to one ear (the attended channel) and a diﬀerent message is delivered to the other ear (the unattended channel), listeners can easily attend to one or the other of these two messages, with almost all of the information in the attended message being determined, while very little about the unattended message is recalled. It was also found that the listeners became quite good at the shadowing task after a few minutes, repeating the attended speech quite accurately. However, after a few minutes of shadowing, listeners had no idea of what the unattended voice was about, or even if English was spoken or not. Based on these observations and others, Cherry conjectured that some sort of spatial ﬁltering of the concurrently occurring sounds/voices might be helpful in attending to the message. It is noteworthy that Cherry (1953) also suggested some procedures to design a “ﬁlter” (machine) to solve the CPP, accounting for the following: (1) the voices come from diﬀerent directions; (2) lip reading, gesture, and the like; (3) diﬀerent speaking voices, mean pitches, mean speeds, male vs. female, and so forth; (4) diﬀerent accents and linguistic factors; and (5) transition probabilities (based on subject matter, voice dynamics, syntax, etc.). In addition, Cherry also speculated that humans have a vast memory of transition probabilities that make the task of hearing much easier by allowing prediction of word sequences. The main ﬁndings of the dichotic listening experiments conducted by Cherry and others have revealed that, in general, it is diﬃcult to attend to two sound sources at once; and when we switch attention to an unattended source (e.g., by

54

The Machine Cocktail Party Problem

listening to a spoken name), we may lose information from the attended source. Indeed, our own common experiences teach us that when we attempt to tackle more than one task at a time, we may end up sacriﬁcing performance. In subsequent joint investigations with colleagues (Cherry, 1961; Cherry and Sayers, 1956, 1959; Sayers and Cherry, 1957), Cherry also studied the binaural fusion mechanism and proposed a cross-correlation-based technique for measuring certain parameters of speech intelligibility. Basically, it was hypothesized that the brain performs correlation on signals received by the two ears, playing the role of localization and coincidence detection. In the binaural fusion studies, Sayers and Cherry (1957) showed that the human brain does indeed execute short-term correlation analysis for either monaural or binaural listening. To sum up, Cherry not only coined the term “cocktail party problem,” but also was the ﬁrst experimentalist to investigate the beneﬁts of binaural hearing and point to the potential of lip-reading, etc., for improved hearing, and to emphasize the critical role of correlation in binaural fusion—Cherry was indeed a pioneer of human communication. 3.1.2 independent component analysis

Independent Component Analysis

The development of independent component analysis (ICA) was partially motivated by a desire to solve a cocktail party problem. The essence of ICA can be stated as follows: Given an instantaneous linear mixture of signals produced by a set of sources, devise an algorithm that exploits a statistical discriminant to diﬀerentiate these sources so as to provide for the separation of the source signals in a blind manner (Bell and Sejnowski, 1995; Comon, 1994; Jutten and Herault, 1991). The key question is how? To address this question, we ﬁrst recognize that if we are to achieve the blind separation of an instantaneous linear mixture of independent source signals, then there must be a characteristic departure from the simplest possible source model: an independently and identically distributed (i.i.d.) Gaussian model; violation of which will give rise to a more complex source model. The departure can arise in three diﬀerent ways, depending on which of the three characteristic assumptions embodied in this simple source model is broken, as summarized here (Cardoso, 2001): Non-Gaussian i.i.d. model: In this route to blind source separation, the i.i.d. assumption for the source signals is retained but the Gaussian assumption is abandoned for all the sources, except possibly for one of them. The Infomax algorithm due to Bell and Sejnowski (1995), the natural gradient algorithm due to Amari et al. (1996), Cardoso’s JADE algorithm (Cardoso, 1998; Cardoso and Souloumiac, 1993), and the FastICA algorithm due to Hyv¨ arinen and Oja (1997) are all based on the non-Gaussian i.i.d. model. Besides, these algorithms diﬀer from each other in the way in which incoming source information residing in higher-order statistics (HoS) is exploited.

3.1

Some Historical Notes

55

Gaussian non-stationary model: In this second route to blind source separation, the Gaussian assumption is retained for all the sources, which means that secondorder statistics (i.e., mean and variance) are suﬃcient for characterizing each source signal. Blind source separation is achieved by exploiting the property of nonstationarity, provided that the source signals diﬀer from each other in the ways in which their statistics vary with time. This approach to blind source separation was ﬁrst described by Parra and Spence (2000) and Pham and Cardoso (2001). Whereas the algorithms focusing on the non-Gaussian i.i.d. model operate in the time domain, the algorithms that belong to the Gaussian nonstationary model operate in the frequency domain, a feature that also makes it possible for the second class of ICA algorithms to work with convolutive mixtures. Gaussian, stationary, correlated-in-time model: In this third and ﬁnal route to blind source separation, the blind separation of Gaussian stationary source signals is achieved on the proviso that their power spectra are not proportional to each other. Recognizing that the power spectrum of a wide-sense stationary random process is related to the autocorrelation function via the Wiener-Khintchine theorem, spectral diﬀerences among the source signals translate to corresponding diﬀerences in correlated-in-time behavior of the source signals. It is this latter property that is available for exploitation. To sum up, Comon’s 1994 paper and the 1995 paper by Bell and Sejnowski have been the catalysts for the literature in ICA theory, algorithms, and novel applications. Indeed, the literature is so extensive and diverse that in the course of ten years, ICA has established itself as an indispensable part of the ever-expanding discipline of statistical signal processing, and has had a great impact on neuroscience (Brown et al., 2001). On technical grounds, however, Haykin and Chen (2005) justify the statement that ICA does not solve the cocktail party problem; rather, it addresses a blind source separation (BSS) problem. 3.1.3 temporal binding

Temporal Binding and Oscillatory Correlation

Temporal binding theory was most elegantly illustrated by von der Malsburg (1981) in his seminal technical report entitled “Correlation Theory of Brain Function,” in which he made two important observations: (1) the binding mechanism is accomplished by virtue of correlation between presynaptic and postsynaptic activities, and (2) strengths of the synapses follow Hebb’s postulate of learning. When the synchrony between the presynaptic and postsynaptic activities is strong (weak), the synaptic strength would correspondingly increase (decrease) temporally. Moreover, von der Malsburg suggested a dynamic link architecture to solve the temporal binding problem by letting neural signals ﬂuctuate in time and by synchronizing those sets of neurons that are to be bound together into a higher-level symbol/concept. Using the same idea, von der Malsburg and Schneider (1986) proposed a solution to the cocktail party problem. In particular, they developed a neural cocktail-party processor that uses synchronization and desynchronization to segment the incom-

56

The Machine Cocktail Party Problem

ing sensory inputs. Though merely based on simple experiments (where von der Malsburg and Schneider used amplitude modulation and stimulus onset synchrony as the main acoustic cues, in line with Helmholtz’s suggestion), the underlying idea is illuminating in that the model is consistent with anatomic and physiologic observations. The original idea of von der Malsburg was subsequently extended to diﬀerent sensory domains, whereby phases of neural oscillators were used to encode the binding of sensory components (Brown and Wang, 1997; Wang et al., 1990). Of particular interest is the two-layer oscillator model due to Wang and Brown (1999). The aim of this model is to achieve “searchlight attention” by examining the temporal cross-correlation between the activities of pairs (or populations) of neurons. The ﬁrst layer, segmentation layer, acts as a locally excitatory, globally inhibitory oscillator; and the second layer, grouping layer, essentially performs computational auditory scene analysis (CASA). Preceding the oscillator network, there is an auditory periphery model (cochlear and hair cells) as well as a middle-level auditory representation stage (correlogram). As reported by Wang and Brown (1999), the model is capable of segregating a mixture of voiced speech and diﬀerent interfering sounds, thereby improving the signal-to-noise ratio (SNR) of the attended speech signal. The correlated neural oscillator is arguably biologically plausible; however, unlike ICA algorithms, the performance of the neural oscillator model appears to deteriorate signiﬁcantly in the presence of multiple competing sources. 3.1.4

Cortronic Processing

The idea of so-called cortronic network was motivated by the fact that the human brain employs an eﬃcient sparse coding scheme to extract the features of sensory inputs and accesses them through associative memory (Hecht-Nielsen, 1998). Speciﬁcally, in (Sagi et al., 2001), the CPP is viewed as an aspect of the human speech recognition problem in a cocktail party environment, and the solution is regarded as an attended source identiﬁcation problem. In the experiments reported therein, only one microphone was used to record the auditory scene; however, the listener was assumed to be familiar with the language of conversation under study. Moreover, all the subjects were chosen to speak the same language and have the similar voice qualities. The goal of the cortronic network is to identify one attended speech of interest, which is the essence of the CPP. According to the experimental results reported in (Sagi et al., 2001), it appears that the cortronic network is quite robust with respect to variations in speech, speaker, and noise, even under a −8 dB SNR (with a single microphone). Compared to other computational approaches proposed to solve the CPP, the cortronic network distinguishes itself by exploiting prior knowledge pertaining to speech and the spoken language; it also implicitly conﬁrms the validity of Cherry’s early speculation in terms of the use of memory (recalling Section 3.1.1)

3.2

3.2

Human Auditory Scene Analysis: An Overview

57

Human Auditory Scene Analysis: An Overview Human auditory scene analysis (ASA) is a general process carried out by the auditory system of a human listener for the purpose of extracting information pertaining to a sound source of interest, which is embedded in a background of noise or interference. The auditory system is made up of two ears (constituting the organs of hearing) and auditory pathways. In more speciﬁc terms, it is a sophisticated informationprocessing system that enables us to detect not only the frequency composition of an incoming sound wave but also locate the sound sources (Kandel et al., 2000). This is all the more remarkable, given the fact that the energy in the incoming sound waves is exceedingly small and the frequency composition of most sounds is rather complicated. 3.2.1

Where and What

The mechanisms in auditory perception essentially involve two processes: sound localization (“where”) and sound recognition (“what”). It is well known that for localizing sound sources in the azimuthal plane, interaural time diﬀerence (ITD) is the main acoustic cue for sound location at low frequencies; and for complex stimuli with low-frequency repetition, interaural level is the main cue for sound localization at high frequencies (Blauert, 1983; Yost, 2000; Yost and Gourevitch, 1987). Spectral diﬀerences provided by the head-related transfer function (HRTF) are the main cues used for vertical localization. Loudness (intensity) and early reﬂections are possible cues for localization as a function of distance. In hearing, the precedence eﬀect refers to the phenomenon that occurs during auditory fusion when two sounds of the same order of magnitude are presented dichotically and produce localization of the secondary sound waves toward the outer ear receiving the ﬁrst sound stimulus (Yost, 2000); the precedence eﬀect stresses the importance of the ﬁrst wave in determining the sound location. The what question mainly addresses the processes of sound segregation (streaming) and sound determination (identiﬁcation). While having a critical role in sound localization, spatial separation is not considered to be a strong acoustic cue for streaming or segregation (Bregman, 1990). According to Bregman’s studies, sound segregation consists of a two-stage process: feature selection and feature grouping. Feature selection invokes processing the auditory stimuli into a collection of favorable (e.g., frequency-sensitive, pitch-related, temporal-spectral-like) features. Feature grouping, on the other hand, is responsible for combining similar elements of incoming sounds according to certain principles into one or more coherent streams, with each stream corresponding to one informative sound source. Sound determination is more speciﬁc than segregation in that it not only involves segmentation of the incoming sound into diﬀerent streams, but also identiﬁes the content of the sound source in question.

58

The Machine Cocktail Party Problem

3.2.2

Spatial Hearing

From a communication perspective, our two outer ears act as receive antennae for acoustic signals from a speaker or audio source. In the presence of one (or fewer) competing or masking sound source(s), the human ability to detect and understand the source of interest (i.e., target) is degraded. However, the inﬂuence of masking source(s) generally decreases when the target and masker(s) are spatially separated, compared to when the target and masker(s) are in the same location; this eﬀect is credited to spatial hearing (ﬁltering). As pointed out in section 3.1.1, Cherry (1953) suggested that spatial hearing plays a major role in the auditory system’s ability to separate sound sources in a multiple-source acoustic environment. Many subsequent experiments have veriﬁed Cherry’s conjecture. Speciﬁcally, directional hearing (Yost and Gourevitch, 1987) is crucial for suppressing the interference and enhancing speech intelligibility (Bronkhorst, 2000; Hawley et al., 1999). Spatial separation of the sound sources is also believed to be more beneﬁcial to localization than segregation (Bregman, 1990). The classic book by Blauert (1983) presents a comprehensive treatment of the psychophysical aspect of human sound localization. Given multiple-sound sources in an enclosed space (such as a conference room), spatial hearing helps the brain to take full advantage of slight diﬀerences (e.g., timing, intensity) between the signals that reach the two outer ears. This is done to perform monaural (autocorrelation) and binaural (cross-correlation) processing for speciﬁc tasks (such as coincidence detection, precedence detection, localization, and fusion), based on which auditory events are identiﬁed and followed by higher-level auditory processing (i.e., attention, streaming, and cognition). Fig. 3.1 provides a functional diagram of the binaural spatial hearing process. 3.2.3

Binaural Processing

One of the key observations derived from Cherry’s classic experiment described in section 3.1.1 is that it is easier to separate sources heard binaurally than when they are heard monaurally. Quoting from Cherry and Taylor (1954): “One of the most striking facts about our ears is that we have two of them—and yet we hear one acoustic world; only one voice per speaker.” We believe that nature gives us two ears for a reason just like it gives us two eyes. It is the binocular vision (stereovision) and binaural hearing (stereausis) that enable us to perceive the dynamic world and provide the main sensory information sources. Binocular/binaural processing is considered to be crucial in certain perceptual activities (e.g., binocular/binaural fusion, depth perception, localization). Given one sound source, the two ears receive slightly diﬀerent sound patterns due to a ﬁnite delay produced by their physically separated locations. The brain is known to be extremely eﬃcient in extracting and then using diﬀerent acoustic cues (to be discussed in detail later) to perform speciﬁc audition tasks. An inﬂuential binaural phenomenon is the so-called binaural masking (e.g.,

(expectation-driven) (signal-driven)

top-down

Human Auditory Scene Analysis: An Overview

bottom-up

3.2

59

cognition

visual cues formation of the auditory event

consciousness attention segregation

fusion autocorrelation

crosscorrelation

autocorrelation

coincidence detection localization

A functional diagram of of binaural hearing, which consists of physical, psychophysical, and psychological aspects of auditory perception. Adapted from Blauert (1983) with permission.

Figure 3.1

Moore, 1997; Yost, 2000). The threshold of detecting a signal masked in noise can sometimes be lower when listening with two ears than it is when listening with only one ear, which is demonstrated by a phenomenon called binaural masking level diﬀerence (BMLD). It is known (Yost, 2000) that the masked threshold of a signal is the same when the stimuli are presented in a monotic or diotic condition; when the masker and the signal are presented in a dichotic situation, the signal has a lower threshold than in either monotic or diotic conditions. Similarly, many experiments have also veriﬁed that binaural hearing increases speech intelligibility when the speech signal and noise are presented dichotically. Another important binaural phenomenon is binaural fusion, which is the essence of directional hearing. As pointed out in section 3.1.1, the fusion mechanism is naturally modeled as performing some kind of correlation (Cherry, 1961; Cherry and Sayers, 1956), for which a binaural fusion model based on the autocorrelogram and cross-correlogram was proposed (as illustrated in ﬁg. 3.1).

60

3.3

The Machine Cocktail Party Problem

Computational Auditory Scene Analysis In contrast to the human auditory scene analysis (ASA), computational auditory scene analysis (CASA) relies on the development of a computational model of the auditory scene with one of two goals in mind, depending on the application of interest: The design of an intelligent machine, which, by itself, is able to automatically extract and track a sound signal of interest in a cocktail party environment. The design of an adaptive hearing system, which computes the perceptual grouping process missing from the auditory system of a hearing-impaired individual, thereby enabling that individual to attend to a sound signal of interest in a cocktail party environment. Naturally, CASA is motivated by or builds on the understanding we have of human auditory scene analysis, or even more generally, the understanding of human cognition behavior. Following the seminal work of Bregman (1990), many researchers (see e.g., Brown and Cooke, 1994; Cooke, 1993; Cooke and Ellis, 2001; Rosenthal and Okuno, 1998) have tried to exploit the CASA in diﬀerent ways. Representative approaches include the data-driven scheme (Cooke, 1993) and the prediction-driven scheme (Ellis, 1996). The common feature in these two schemes is to integrate low-level (bottom-up, primitive) acoustic cues for potential grouping. The main diﬀerences between them are: Data-driven CASA aims to decompose the auditory scene into time-frequency elements (“strands”), and then run the grouping procedure. On the other hand, prediction-driven CASA views prediction as the primary goal, and it requires only a world model that is consistent with the stimulus; it contains integration of top-down and bottom-up cues and can deal with incomplete or masked data (i.e., speech signal with missing information). However, as emphasized by Bregman (1996), it is important for CASA modelers to take into account psychological data as well as the way humans carry out ASA; namely, modeling the stability of human ASA, making it possible for diﬀerent cues to cooperate and compete, and accounting for the propagation of constraints across the frequencyby-time ﬁeld. 3.3.1

acoustic cue

Acoustic Cues

The psychophysical attributes of sound mainly involve three forms of information: spatial location, temporal structure, and spectral characterization. The perception of a sound signal in a cocktail party environment is uniquely determined by this kind of collective information; any diﬀerence in any of the three forms of information is believed to be suﬃcient to discriminate two diﬀerent sound sources. In sound perception, many acoustic features (cues) are used to perform speciﬁc tasks. Table 3.1 summarizes some visual/acoustic features (i.e., the spatial, temporal, or

3.3

Computational Auditory Scene Analysis Table 3.1

61

The Features and Cues Used in Sound Perception Feature/Cue

Domain

Task

Visual ITD IID Intensity, loudness Periodicity Onsets AM FM Pitch Timbre, tone Hamonicity, formant

Spatial Spatial Spatial Temporal Temporal Temporal Temporal Temporal-spectral Spectral Spectral Spectral

“Where” “Where” “Where” “Where” + “What” “What” “What” “What” “What” “What” “What” “What”

spectral patterns) used for a single-stream sound perception. A brief description of some important acoustic cues listed in table 3.1 is in order: Interaural time diﬀerence (ITD): A measure of the diﬀerence between the time at which the sound waves reach the left ear and the time at which the same sound waves reach the right ear. Interaural intensity diﬀerence (IID): A measure of the diﬀerence in intensity of the sound waves reaching the two ears due to head shadow. Amplitude modulation (AM): A method of sound signal transmission whereby the amplitude of some carrier frequency is modiﬁed in accordance with the sound signal. Frequency modulation (FM): Another method of modulation in which the instantaneous frequency of the carrier is varied with the frequency of the sound signal. Onset: A sudden increase in the energy of a sound signal; as such, each discrete event in the sound signal has an onset. Pitch: A property of auditory sensation in which sounds are ordered on a musical scale; in a way, pitch bears a relationship to frequency that is similar to the relationship of loudness to intensity. Timbre: The attribute of auditory sensation by which a listener is able to discriminate between two sound signals of similar loudness and pitch, but of diﬀerent tonal quality; timbre depends primarily on the spectrum of a sound signal. A combination of some or more of these acoustic cues is the key to perform CASA. Psychophysical evidence also suggests that useful cues may be provided by spectraltemporal correlations (Feng and Ratnam, 2000). 3.3.2 feature binding

Feature Binding

One other important function involved in CASA is that of feature binding, which

62

The Machine Cocktail Party Problem

refers to the problem of representing conjunctions of features. According to von der Malsburg (1999), binding is a general process that applies to all types of knowledge representations, which extend from the most basic perceptual representation to the most complex cognitive representation. Feature binding may be either static or dynamic. Static feature binding involves a representational unit that stands for a speciﬁc conjunction of properties, whereas dynamic feature binding involves conjunctions of properties as the binding of units in the representation of an auditory scene. The most popular dynamic binding mechanism is based on temporal synchrony, hence the reference to it as “temporal binding”; this form of binding was discussed in section 3.1.3. K¨onig et al. (1996) have suggested that synchronous ﬁring of neurons plays an important role in information processing within the cortex. Rather than being a temporal integrator, the cortical neurons might be serving the purpose of a coincidence detector, evidence for which has been addressed by many researchers (K¨onig and Engel, 1995; Schultz et al., 2000; Singer, 1993). Dynamic binding is closely related to the attention mechanism, which is used to control the synchronized activities of diﬀerent assemblies of units and how the ﬁnite binding resource is allocated among neuronal assemblies (Singer, 1993, 1995). Experimental evidence has shown that synchronized ﬁring tends to provide the attended stimulus with an enhanced representation. 3.3.3

Dereverberation

For auditory scene analysis, studying the eﬀect of room acoustics on the cocktail party environment is important (Blauert, 1983; MacLean, 1959). A conversation occurring in a closed room often suﬀers from the multipath eﬀect: echoes and reverberation, which are almost ubiquitous but rarely consciously noticed. According to the acoustics of the room, a reﬂection from one surface (e.g., wall, ground) produces reverberation. In the time domain, the reﬂection manifests itself as smaller, delayed replicas (echoes) that are added to the original sound; in the frequency domain, the reﬂection introduces a comb-ﬁlter eﬀect into the frequency response. When the room is large, echoes can sometimes be consciously heard. It is known that the human auditory system is so powerful that it can take advantage of binaural and spatial hearing to eﬃciently suppress the echo, thereby improving the hearing performance. However, for a machine CPP, the machine design would have to include speciﬁc dereverberation (or deconvolution) algorithms to overcome this eﬀect. Those acoustic cues listed in table 3.1 that are spatially dependent, such as ITD and IID, are naturally aﬀected by reverberation. On the other hand, acoustic cues that are space invariant, such as common onset across frequencies and pitch, are less sensitive to reverberation. On this basis, we may say that an intelligent machine should have the ability to adaptively weight the spatially dependent acoustic cues (prior to their fusion) so as to deal with a reverberant environment in an eﬀective manner.

3.4

3.4

Insights from Computational Vision

63

Insights from Computational Vision It is well known to neuroscientists that audition (hearing) and vision (seeing) share substantial common features in the sensory processing principles as well as anatomic and functional organizations in higher-level centers in the cortex. It is therefore highly informative that with the design of an eﬀective and eﬃcient machine CPP as a design goal, we address the issue of deriving insights from the extensive literature on computational vision. We do so in this section by ﬁrst looking to Marr’s classic vision theory. 3.4.1

Marr’s Vision Theory and Its Insights for Auditory Scene Analysis

In his landmark book, David Marr presented three levels of analysis of informationprocessing systems (Marr, 1982): Computation: What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? Representation: How can this computational theory be implemented? In particular, what is the representation for the input and output, and what is the algorithm for the transformation? Implementation: How can the representation and the algorithm be realized physically?

grouping principle

In many perspectives, Marr’s observations highlight the fundamental questions that need to be addressed in computational neuroscience, not only in the context of vision but also audition. As a matter of fact, Marr’s theory has provided many insights into auditory research (Bregman, 1990; Rosenthal and Okuno, 1998). In a similar vein to visual scene analysis (e.g., Julesz and Hirsh, 1972), auditory scene analysis (Bregman, 1990) attempts to identify the content (what) and the location (where) of the sounds/speech in an auditory environment. In speciﬁc terms, auditory scene analysis consists of two stages. In the ﬁrst stage, the segmentation process decomposes a complex acoustic scene into a collection of distinct sensory elements; in the second stage, the grouping process combines these elements into a stream according to some principles. Subsequently, the streams are interpreted by a higher-level process for recognition and scene understanding. Motivated by Gestalt psychology, Bregman (1990) has proposed ﬁve grouping principles for ASA: Proximity: Characterizes the distances between auditory cues (features) with respect to their onsets, pitch, and intensity (loudness). Similarity: Usually depends on the properties of a sound signal, such as timbre. Continuity: Features the smoothly-varying spectrum of a sound signal. Closure: Completes fragmentary features that have a good gestalt; the completion may be viewed as a form of auditory compensation for masking.

64

The Machine Cocktail Party Problem

Common fate: Groups together activities (e.g., common onsets) that are synchronous. Moreover, Bregman (1990) has distinguished at least two levels of auditory organization: primitive streaming and schema-based segregation, with schemas being provided by phonetic, prosodic, syntactic, and semantic forms of information. While being applicable to general sound scene analysis involving speech and music, Bregman’s work has focused mainly on primitive stream segregation. 3.4.2

A Tale of Two Sides: Visual and Auditory Perception

Visual perception and auditory perception share many common features in terms of sensory processing principles. According to Shamma (2001), these common features include the following: Lateral inhibition for edge/peak enhancement: In an auditory task, it aims to extract the proﬁle of the sound spectrum; whereas in the visual system it aims to extract the form of an image. Multiscale analysis: The auditory system performs the cortical spectrotemporal analysis, whereas the visual system performs the cortical spatiotemporal form analysis. Detecting temporal coincidence: This process may serve periodicity pitch perception in an auditory scene compared to the perception of bilateral symmetry in a visual task. Detecting spatial coincidence: The same algorithm captures binaural azimuthal localization in the auditory system (stereausis), while it gives rise to binocular depth perception in the visual system. Not only sharing these common features and processes, the auditory system also beneﬁts from the visual system. For example, it is well known that there exist interactions between diﬀerent sensory modalities. Neuroanatomy reveals the existence of corticocortical pathways between auditory and visual cortices. The hierarchical organization of cortices and numerous thalamocortical and corticothalamic feedback loops are speculated to stabilize the perceptual object. Daily life experiences also teach us that a visual scene input (e.g., lip reading) is inﬂuential to attention (Jones and Yee, 1996) and beneﬁcial to speech perception. The McGurk eﬀect (McGurk and MacDonald, 1976) is an auditory-visual speech illusion experiment, in which the perception of a speech sound is modiﬁed by contradictory visual information. The McGurk eﬀect clearly illustrates the important role played by a visual cue in the comprehension of speech. 3.4.3

Active Vision

In the last paragraph of the introduction section, we referred to active audition as having the potential to build an intelligent machine that can operate eﬃciently

3.4

Insights from Computational Vision

65

and eﬀectively in a noisy cocktail party environment. The proposal to build such a machine has been inspired by two factors: ongoing research on the use of active vision in computational vision, and the sharing of many common sensory principles between visual perception and auditory perception, as discussed earlier. To pave the way for what we have in mind on a framework for active audition, it is in order that we present some highlights on active vision that are of value to a formulation of this framework. First and foremost, it is important to note that the use of an active sensor is not a necessary requirement for active sensing, be that in the context of vision or audition. Rather, a passive sensor (which only receives but does not transmit information-bearing signals) can perform active sensing, provided that the sensor is capable of changing its own state parameters in accordance with a desired sensing strategy. As such, active sensing may be viewed as an application of intelligent control theory, which includes not only control but also reasoning and decision making (Bajcsy, 1988).4 In particular, active sensing embodies the use of feedback in two contexts: 1. The feedback is performed on complex processed sensory data such as extracted features that may also include relational features. 2. The feedback is dependent on prior knowledge. active vision

Active vision (also referred to as animated vision) is a special form of active sensing, which has been proposed by Bajcsy (1988) and Ballard (1988), among others (e.g., Blake and Yuille, 1992). In active vision, it is argued that vision is best understood in the context of visual behaviors. The key point to note here is that the task of vision is not to build the model of a surrounding real world as originally postulated in Marr’s theory, but rather to use visual information in the service of the real world in real time, and do so eﬃciently and inexpensively (Clark and Eliasmith, 2003). In eﬀect, the active vision paradigm gives “action” a starring role (Sporns, 2003). Rao and Ballard (1995) proposed an active vision architecture, which is motivated by biological studies. The architecture is based on the hierarchical decomposition of visual behavior involved in scene analysis (i.e., relating internal models to external objects). The architecture employs two components: the “what” component that corresponds to the problem of object identiﬁcation, and the “where” component that corresponds to the problem of objective localization. These two visual components or routines are subserved by two separate memories. The central representation of the architecture is a high-dimensional iconic feature vector, which is comprised of the responses of diﬀerent-order derivatives of Gaussian ﬁlters; the purpose of the iconic feature vector is to provide an eﬀective photometric description of local intensity variations in the image region about an object of interest.

66

3.5

The Machine Cocktail Party Problem

Embodied Intelligent Machines: A Framework for Active Audition

embodied cognitive models

Most of the computational auditory scene analysis (CASA) approaches discussed in the literature share a common assumption: the machine merely listens to the environment but does not interact with it (i.e., the observer is passive). However, as remarked in the previous section, there are many analogies between the mechanisms that go on in auditory perception and their counterparts in visual perception. In a similar vein to active vision, active audition is established on the premise that the observer (human or machine) interacts with the environment, and the machine (in a way similar to human) should also conduct the perception in an active fashion. According to Varela et al. (1991) and Sporns (2003), embodied cognitive models rely on cognitive processes that emerge from interactions between neural, bodily, and environmental factors. A distinctive feature of these models is that they use “the world as their own model.” In particular, embodied cognition has been argued to be the key to the understanding of intelligence (Iida et al., 2004; Pfeifer and Scheier, 1999). The central idea of embodied cognitive machines lies in the observation that “intelligence” becomes meaningless if we exclude ourselves from a real-life scenario; in other words, an intelligent machine is a self-reliant and independent agent capable of adapting itself to a dynamic environment so as to achieve a certain satisfactory goal eﬀectively and eﬃciently, regardless of the initial setup. Bearing this goal in mind, we may now propose a framework for active audition, which embodies four speciﬁc functions: (1) localization and focal attention, (2) segregation, (3) tracking, and (4) learning. In the following, we will address these four functions in turn. 3.5.1

sound localization

Localization and Focal Attention

Sound localization is a fundamental attribute of auditory perception. The task of sound localization can be viewed as a form of binaural depth perception, representing the counterpart to binocular depth perception in vision. A classic model for sound localization was developed by Jeﬀress (1948) using binaural cues such as ITD. In particular, Jeﬀress suggested the use of cross-correlation for calculating the ITD in the auditory system and explained how the model represents the ITD that is received at the ears; the sound processing and representation in Jeﬀress’s model are simple yet elegant, and arguably neurobiologically plausible. Since the essential goal of localization is to infer the directions of incoming sound signals, this function may be implemented by using an adaptive array of microphones, whose design is based on direction of arrival (DOA) estimation algorithms developed in the signal-processing literature (e.g., Van Veen and Buckley, 1988, 1997). Sound localization is often the ﬁrst step to perform the beamforming, the aim of which is to extract the signal of interest produced in a speciﬁc direction. For a robot (or machine) that is self-operating in an open environment, sound localization is essential for successive tasks. An essential ingredient in sound localization is

3.5

Embodied Intelligent Machines: A Framework for Active Audition

67

time-delay estimation when it is performed in a reverberant room environment. To perform this estimation, many signal-processing techniques have been proposed in the literature: Generalized cross-correlation (GCC) method (Knappand and Carter, 1976): This is a simple yet eﬃcient delay-estimation method, which is implemented in the time domain using maximum-likelihood estimation (however, a frequency-domain implementation is also possible). Cross-power spectrum phase (CSP) method (Rabinkin et al., 1996): This delayestimation method is implemented in the frequency domain, which computes the power spectra of two microphone signals and returns the phase diﬀerence between the spectra. Adaptive eigenvalue decomposition (EVD)–based methods (Benesty, 2000; Doclo and Moonen, 2002): It is noted that the GCC and CSP methods usually assume an ideal room model without reverberation; hence they may not perform satisfactorily in a highly reverberant environment. In order to overcome this drawback and enhance robustness, EVD-based methods have been proposed to estimate (implicitly) the acoustic impulse responses using adaptive algorithms that iteratively estimate the eigenvector associated with the smallest eigenvalue. Given the estimated acoustic impulse responses, the time delay can be calculated as the time diﬀerence between the main peak of the two impulse responses or as the peak of the correlation function between the two impulse responses. Upon locating the sound source of interest, the next thing is to focus on the target sound stream and enhance it. Therefore, spatial ﬁltering or beamforming techniques (Van Veen and Buckley, 1997) will be beneﬁcial for this purpose. Usually, with omnidirectional microphone (array) technology, a machine is capable of picking up most if not all of the sound sources in the auditory scene. However, it is hoped that “smart microphones” may be devised so as to adapt their directivity (i.e., autodirective) to the attended speaker considering real-life conversation scenarios. Hence, designing a robust beamformer in a noisy and reverberant environment is crucial for localizing the sound and enhancing the SNR. The adaptivity also naturally brings in the issue of learning, to be discussed in what follows. 3.5.2

Segregation

In this second functional module for active audition, the target sound stream is segregated and the sources of interference are suppressed, thereby focusing attention on the target sound source. This second function may be implemented by using several acoustic cues (e.g., ITD, IID, onset, and pitch) and then combining them in a fusion algorithm. In order to emulate the human auditory system, a computational strategy for acoustic-cue fusion should dynamically resolve the ambiguities caused by the simplecue segregation. The simplest solution is the “winner-take-all” competition, which

68

The Machine Cocktail Party Problem

IID Segregation AND

Pitch Segregation

Onset Segregation

ITD Segregation AND

Figure 3.2

A ﬂowchart of multiple acoustic cues fusion process (courtesy of Rong

Dong). essentially chooses the cue that has the highest (quantitatively) conﬁdence (where the conﬁdence values depend on the speciﬁc model used to extract the acoustic cue). When several acoustic cues are in conﬂict, only the dominant cue will be chosen based on some criterion, such as the weighted-sum mechanism (Woods et al., 1996) that was used for integrating pitch and spatial cues, or the Bayesian framework (Kashino et al., 1998). Recently, Dong (2005) proposed a simple yet eﬀective fusion strategy to solve the multiple cue fusion problem (see ﬁg. 3.2). Basically, the fusion process is performed in a cooperative manner: in the ﬁrst stage of fusion, given IID and ITD cues, the time-frequency units are grouped into two streams (target stream and interference stream), and the grouping results are represented by two binary maps. These two binary maps are then passed through an “AND” operation to obtain a spatial segregation map, which is further utilized to estimate the pitch of the target signal or the pitch of the interference. Likewise, a binary map is produced from the pitch segregation. If the target is detected as an unvoiced signal, onset cue is integrated to group the components into separate streams. Finally, all these binary maps are pooled together by a second “AND” operation to yield the ﬁnal segregation decision. Empirical experiments on this fusion algorithm reported by Dong (2005) have shown very promising results.5 3.5.3 state-space model

Tracking

The theoretical development of sound tracking builds on a state-space model of the auditory environment. The model consists of a process equation that describes the evolution of the state (denoted by xt ) at time t, and a measurement equation that describes the dependence of the observables (denoted by yt ) on the state. More speciﬁcally, the state is a vector of acoustic cues (features) characterizing the

3.5

Embodied Intelligent Machines: A Framework for Active Audition

69

prediction

top-down

prediction

top-down

inference

bottom-up

acoustic and visual cues

An information ﬂowchart integrating “bottom-up” (shaded arrow) and “top-down” ﬂows in a hierarchical functional module of an intelligent machine (in a similar fashion as in the human auditory cortex). Figure 3.3

target sound stream and its direction. Stated mathematically, we have the following state-space equations: xt = f (t, xt−1 , dt ),

(3.1)

yt = g(t, xt , ut , vt ),

(3.2)

where dt and vt denote the dynamic and measurement noise processes, respectively, and the vector ut denotes the action taken by the (passive) observer. The process equation 3.1 embodies the state transition probability p(xt |xt−1 ), whereas the measurement equation 3.2 embodies the likelihood p(yt |xt ). The goal of optimum ﬁltering is then to estimate the posterior probability density p(xt |y0:t ), given the initial prior p(x0 ) and y0:t that denotes the measurement history from time 0 to t. This classic problem is often referred to as “state estimation” in the literature. Depending on the speciﬁc scenario under study, such a hidden state estimation problem can be tackled by using a Kalman ﬁlter (Kalman, 1960), an extended Kalman ﬁlter, or a particle ﬁlter (e.g., Capp´e et al., 2005; Doucet et al., 2001). In Nix et al. (2003), a particle ﬁlter is used as a statistical method for integrating temporal and frequency-speciﬁc features of a target speech signal. The elements of the state represent the azimuth and elevation of diﬀerent sound signals as well as the band-grouped short-time spectrum for each signal; whereas the observable measurements contain binaural short-time spectra of the superposed voice signals. The state equation, representing the spectral dynamics of the speech signal, was learned oﬀ-line using vector quantization and lookup table in a large codebook, where the codebook index for each pair of successive spectra was stored in a Markov transition matrix (MTM); the MTM provides statistical information

70

The Machine Cocktail Party Problem

about the transition probability p(xt |xt−1 ) between successive short-time speech spectra. The measurement equation, characterized by p(yt |xt ), was approximated as a multidimensional Gaussian mixture probability distribution. By virtue of its very design, it is reported in Nix et al. (2003) that the tracker provides a one-step prediction of the underlying features of the target sound. In the much more sophisticated neurobiological context, we may envision that the hierarchical auditory cortex (acting as a predictor) implements an online tracking task as a basis for dynamic feature binding and Bayesian estimation, in a fashion similar to that in the hierarchical visual cortex (Lee and Mumford, 2003; Rao and Ballard, 1999). Naturally, we may also incorporate “top-down” expectation as a feedback loop within the hierarchy to build a more powerful inference/prediction model. This is motivated by the generally accepted fact that the hierarchical architecture is omnipresent in sensory cortices, starting with the primary sensory cortex and proceeding up to the highest areas that encode the most complex, abstract, and stable information.6 A schematic diagram illustrating such a hierarchy is depicted in ﬁg 3.3, where the bottom-up (data-driven) and top-down (knowledge-driven) information ﬂows are illustrated with arrows. In the ﬁgure, the feedforward pathway carries the inference, given the current and past observations; the feedback pathway conducts the prediction (expectation) to lowerlevel regions. To be speciﬁc, let z denote the top-down signal; then the conditional joint probability of hidden state x and bottom-up observation y, given z, may be written as p(x, y|z) = p(y|x, z)p(x|z), Bayes’s rule

(3.3)

and the posterior probability of the hidden state can be expressed via Bayes’s rule: p(x|y, z) =

p(y|x, z)p(x|z) , p(y|z)

(3.4)

where the denominator is a normalizing constant term that is independent of the state x, the term p(x|z) in the numerator characterizes a top-down contextual prior, and the other term p(y|x, z) describes the likelihood of the observation, given all available information. Hence, feedback information from a higher level can provide useful context to interpret or disambiguate the lower-level patterns. The same inference principle can be applied to diﬀerent levels of the hierarchy in ﬁg. 3.3. To sum up, the top-down predictive coding and bottom-up inference cooperate for learning the statistical regularities in the sensory environment; the top-down and bottom-up mechanisms also provide a possible basis for optimal action control within the framework of active audition in a way similar to active vision (Bajcsy, 1988). 3.5.4

Learning

Audition is a sophisticated, dynamic information-processing task performed in the human brain, which inevitably invokes other tasks almost simultaneously (such

3.6

Concluding Remarks

71

B

A internal reward

perceive

think

external reward agent

action a state x

environment

act

Figure 3.4 (A) The sensorimotor feedback loop consisting of three distinct functions: perceive, think, and act. (B) The interaction between the agent and environ-

ment.

reinforcement learning

3.6

as action). Speciﬁcally, it is commonly believed that perception and action are mutually coupled, and integrated via a sensorimotor interaction feedback loop, as illustrated in ﬁg 3.4A. Indeed, it is this unique feature that enables the human to survive in a dynamic environment. For the same reason, it is our belief that an intelligent machine that aims at solving the CPP must embody a learning capability, which must be of a kind that empowers the machine to take action whenever changes in the environment call for it. In the context of embodied intelligence, an autonomous agent is also supposed to conduct a goal-oriented behavior during its interaction with the dynamic environment; hence, the necessity for taking action naturally arises. In other words, the agent has to continue to adapt itself (it terms of action or behavior) to maximize its (internal or external) reward, in order to achieve better perception of its environment (illustrated in ﬁg. 3.4B). Such a problem naturally brings in the theory of reinforcement learning (Sutton and Barto, 1998). For example, imagine a maneuverable machine is aimed at solving a computational CPP in a noisy room environment. The system then has to learn how to adjust its distance and the angle of the microphone array with respect to attended audio sources (such as speech, music, etc.). To do so, the machine should have a built-in rewarding mechanism when interacting with the dynamic environment, and it has to gradually adapt its behavior to achieve a higher (internal and external) reward.7 To sum up, an autonomous intelligent machine self-operating in a dynamic environment will always need to conduct optimal action control or decision making. Since this problem bears much resemblance to the Markov decision process (MDP) (Bellman and Dreyfus, 1962), we may resort to the well-established theory of dynamical programming and reinforcement learning.8

Concluding Remarks In this chapter, we have discussed the machine cocktail party problem and explored possible ways to solve it by means of an intelligent machine. To do so, we have brieﬂy reviewed historical accounts of the cocktail party problem, as well as the important aspects of human and computational auditory scene analysis. More important, we

72

The Machine Cocktail Party Problem

have proposed a computational framework for active audition as an inherent part of an embodied cognitive machine. In particular, we highlighted the essential functions of active audition and discussed its possible implementation. The four functions identiﬁed under the active audition paradigm provide the basis for building an embodied cognitive machine that is capable of human-like hearing in an “active” fashion. The central tenet of active audition embodying such a machine is that an observer may be able to understand an auditory environment more eﬀectively and eﬃciently if the observer interacts with the environment than if it is a passive observer. In addition, in order to build a maneuverable intelligent machine (such as a robot), we also discuss the issue of integrating diﬀerent sensory (auditory and visual) features such that active vision and active audition can be combined in a single system to achieve the true sense of active perception.

Appendix: Reinforcement Learning Mathematically, a Markov decision process

9

is formulated as follows:

Deﬁnition 3.1 A Markov decision process (MDP) is deﬁned as a 6-tuple (S, A, R, p0 , ps , pr ), where S is a (ﬁnite) set of (observable) environmental states (state space), s ∈ S; A is a (ﬁnite) set of actions (action space), a ∈ A; R is a (ﬁnite) set of possible rewards; p0 is an initial probability distribution over S; it is written as p0 (s0 ); ps is a transition probability distribution over S conditioned on a value from S × A; it is also written as pass or ps (st |st−1 , at−1 );10 pr is a probability distribution over R, conditioned on a value from S; it is written as pr (r|s). Deﬁnition 3.2 A policy is a mapping from states to probabilities of selecting each possible action. A policy, denoted by π, can be deterministic, π : S → A, or stochastic, π : S → P (A). An optimal policy π ∗ is a policy that maximizes (minimizes) the expected total reward (cost) over time (within ﬁnite or inﬁnite horizon). Given the above deﬁnitions, the goal of reinforcement learning is to ﬁnd an optimal policy π ∗ (s) for each s, which maximizes the expected reward received over time. We assume that the policy is stochastic, and Q-learning (a special form of reinforcement learning) is aimed at learning a stochastic world in the sense that the ps and pr are both nondeterministic. To evaluate reinforcement learning, a

3.6

Concluding Remarks

73

common measure of performance is the inﬁnite-horizon discounted reward, which can be represented by the state-value function V π (s) = Eπ

∞

γ t r(st , π(st ))s0 = s ,

(3.5)

t=0

where 0 ≤ γ < 1 is a discount factor, and Eπ is the expectation operator over the policy π. The value function V π (s) deﬁnes the expected discounted reward at state s, as shown by V π (s) = Eπ {Rt |st = s},

(3.6)

where Rt = rt+1 + γrt+2 + γ 2 rt+3 + · · · = rt+1 + γ(rt+2 + γrt+2 + · · · ) = rt+1 + γRt+1 . Similarly, one may deﬁne a state-action value function, or the so-called Q-function Qπ (s, a) = Eπ {Rt |st = s, at = a},

(3.7)

for which the goal is not only to achieve a maximal reward but also to ﬁnd an optimal action (supposing multiple actions are accessible for each state). It can be shown that π(s, s )V π (s ), V π (s) = Eπ {r(s, π(s))} + γ π

Q (s, a) =

s

pass [R(s, a, s )

+ γV π (s )],

s ∈S

V π (s) =

π(s, a)Qπ (s, a) =

a∈A

and

π(s, a) rsa + γ pass V π (s ) , s

a∈A

which correspond to diﬀerent forms of the Bellman equation (Bellman and Dreyfus, 1962). Note that if the state or action is continuous-valued, the summation operations are replaced by corresponding integration operations. The optimal value functions are then further deﬁned as: ∗

V π (s) = max V π (s), π

∗

Qπ (s, a) = max Qπ (s, a). π

Therefore, in light of dynamic programming theory (Bellman and Dreyfus, 1962), the optimal policy is deterministic and greedy with respect to the optimal value functions. Speciﬁcally, we may state that given the state s and the optimal policy π ∗ , the optimal action is selected according to the formula ∗

a∗ = arg max Qπ (s, a), a∈A

∗

∗

such that V π (s) = max Qπ (s, a). a∈A

74

The Machine Cocktail Party Problem

A powerful reinforcement learning tool for tackling the above-formulated problem is the Q-learning algorithm (Sutton and Barto, 1998; Watkins, 1989). The classic Q-learning is an asynchronous, incremental, and approximate dynamic programming method for stochastic optimal control. Unlike the traditional dynamic programming, Q-learning is model-free in the sense that its operation requires neither the state transition probability nor the environmental dynamics. In addition, Q-learning is computationally eﬃcient and can be operated in an online manner (Sutton and Barto, 1998). For ﬁnite state and action sets, if each (s, a) pair is visited inﬁnitely and the step-size sequence used in Q-learning is nonincreasing, then Q-learning is assured to converge to the optimal policy with probability 1 (Tsitsiklis, 1994; Watkins and Dayan, 1992). For problems with continuous state, functional approximation methods can be used for tackling the generalization issue; see Bertsekas and Tsitsiklis (1996) and Sutton and Barto (1998) for detailed discussions. Another model-free reinforcement learning algorithm is the actor-critic model (Sutton and Barto, 1998), which describes a bootstrapping strategy for reinforcement learning. Speciﬁcally, the actor-critic model has separate memory structures to represent the policy and the value function: the policy structure is conducted within the actor that selects the optimal actions; the value function is estimated by the critic that criticizes the actions made by the actor. Learning is always on-policy in that the critic uses a form of temporal diﬀerence (TD) error to maximize the reward, whereas the actor will use the estimated value function from the critic to bootstrap itself for a better policy. A much more challenging but more realistic reinforcement learning problem is the so-called partially observable Markov decision process (POMDP). Unlike the MDP that assumes the full knowledge of observable states, POMDP addresses the stochastic decision making and optimal control problems with only partially observable states in the environment. In this case, the elegant Bellman equation does not hold since it requires a completely observable Markovian environment (Kaelbling, 1993). The literature of POMDP is intensive and ever growing, hence it is beyond the scope of the current chapter to expound this problem; we refer the interested reader to the papers (Kaelbling et al., 1998; Lovejoy, 1991; Smallwood and Sondik, 1973) for more details. Acknowledgments This chapter grew out of a review article (Haykin and Chen, 2005). In particular, we would like to thank our research colleagues R. Dong and S. Doclo for valuable feedback. The work reported here was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

NOTES

Notes

75

1 For

tutorial treatments of the cocktail party problem, see Chen (2003), Haykin and Chen (2005), and Divenyi (2004). 2 Categorization of these three neural processes, done essentially for research-related studies, is somewhat artiﬁcial; the boundary between them is fuzzy in that the brain does not necessarily distinguish between them as deﬁned herein. 3 In Cherry’s original paper, “intelligibility” is referred to as the probability of correctly identifying meaningful speech sounds; in contrast, “articulation” is referred to as a measure of nonsense speech sounds (Cherry, 1953). 4 As pointed out by Bajcsy (1988), the proposition that active sensing is an application of intelligent control theory may be traced to the PhD thesis of Tenenbaum (1970). 5 At McMaster University we are currently exploring the DSP hardware implementation of the fusion scheme that is depicted in ﬁg. 3.2. 6 The “top-down” inﬂuence is particularly useful for (1) synthesizing missing information (e.g., the auditory “ﬁll-in” phenomenon); (2) incorporating contextual priors and inputs from other sensory modalities; and (3) resolving perceptual ambiguities whenever lower-level information leads to confusion. 7 The reward can be a measure of speech intelligibility, signal-to-interference ratio, or some sort of utility function. 8 Reinforcement learning is well known in the machine learning community, but, regrettably, not so in the signal processing community. An appendix on reinforcement learning is included at the end of the chapter largely for the beneﬁt of readers who may not be familiar with this learning paradigm. 9 For the purpose of exposition simplicity, we restrict our discussion to ﬁnite discrete state and action spaces, but the treatment also applies to the more general continuous state or action space. 10 In the case of ﬁnite discrete state, p constitutes a transition matrix. s

4

Sensor Adaptive Signal Processing of Biological Nanotubes (Ion Channels) at Macroscopic and Nano Scales

Vikram Krishnamurthy

Ion channels are biological nanotubes formed by large protein molecules in the cell membrane. All electrical activities in the nervous system, including communications between cells and the inﬂuence of hormones and drugs on cell function, are regulated by ion channels. Therefore understanding their mechanisms at a molecular level is a fundamental problem in biology. This chapter shows how dynamic stochastic models and associated statistical signal-processing techniques together with novel learning-based stochastic control methods can be used to understand the structure and dynamics of ion channels at both macroscopic and nanospatial scales. The unifying theme of this chapter is the concept of sensor adaptive signal processing, which deals with sensors dynamically adjusting their behavior so as to optimize their ability to extract signals from noise.

4.1

Introduction All living cells are surrounded by a cell membrane, composed of two layers of phospholipid molecules, called the lipid bilayer. Ion channels are biological nanotubes formed by protein macromolecules that facilitate the diﬀusion of ions across the cell membrane. Although we use the term biological nanotube, ion channels are typically the size of angstrom units (10−10 m), i.e., an order of magnitude smaller in radius and length compared to carbon nanotubes that are used in nanodevices. In the past few years, there have been enormous strides in our understanding of the structure-function relationships in biological ion channels. These advances have been brought about by the combined eﬀorts of experimental and computational biophysicists, who together are beginning to unravel the working principles of these

78

Sensor Adaptive Signal Processing of Biological Nanotubes

exquisitely designed biological nanotubes that regulate the ﬂow of charged particles across the cell membrane. The measurement of ionic currents ﬂowing through single ion channels in cell membranes has been made possible by the gigaseal patchclamp technique (Hamill et al., 1981; Neher and Sakmann, 1976). This was a major breakthrough for which the authors Neher and Sakmann won the 1991 Nobel Prize in Medicine (Neher and Sakmann, 1976). More recently, the 2003 Nobel Prize in Chemistry was awarded to MacKinnon for determining the structure of several diﬀerent types of ion channels (including the bacterial potassium channel; Doyle et al. (1998)) from crystallographic analyses. Because all electrical activities in the nervous system, including communications between cells and the inﬂuence of hormones and drugs on cell function, are regulated by membrane ion channels, understanding their mechanisms at a molecular level is a fundamental problem in biology. Moreover, elucidation of how single ion channels work will ultimately help neurobiologists ﬁnd the causes of, and possibly cures for, a number of neurological and muscular disorders. We refer the reader to the special issue of IEEE Transactions on NanoBioScience Krishnamurthy et al. (2005) for an excellent up-to-date account of ion channels written by leading experts in the area. This chapter addresses two fundamental problems in ion channels from a statistical signal processing and stochastic control (optimization) perspective: the gating problem and the ion permeation problem. The gating problem (Krishnamurthy and Chung, 2003) deals with understanding how ion channels undergo structural changes to regulate the ﬂow of ions into and out of a cell. Typically a gated ion channel has two states: a “closed” state which does not allow ions to ﬂow through, and an “open” state which does allow ions to ﬂow through. In the open state, the ion channel currents are typically of the order of pico-amps (i.e., 10−12 amps). The measured ion channel currents (obtained by sampling typically at 10 kHz, i.e, 0.1 millisecond time scale) are obfuscated by large amounts of thermal noise. In sections 4.2 4.3 of this chapter, we address the following issues related to the gating problem: (1) We present a hidden Markov model (HMM) formulation of the observed ion channel current. (2) We present in section 4.2 a discrete stochastic optimization algorithm for controlling a patch-clamp experiment to determine the Nernst potential of the ion channel with minimal eﬀort. This ﬁts in the class of so-called experimental design problems. (3) In section 4.3, we brieﬂy discuss dynamic scheduling algorithms for activating multiple ion channels on a biological chip so as to extract maximal information from them. The permeation problem (Allen et al., 2003; O’Mara et al., 2003) seeks to explain the working of an ion channel at an ˚ A(10−10 m) spatial scale by studying the propagation of individual ions through the ion channel at a femto (10−15 ) second time scale. This setup is said to be at a mesoscopic scale since the individual Ain radius and are comparable in ions (e.g., Na+ ions) are of the order of a few ˚ radius to the ion channel. At this mesoscopic level, point-charge approximations and continuum electrostatics break down. The discrete ﬁnite nature of each ion

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

sensor adaptive signal processing

4.2

79

needs to be taken into consideration. Also, failure of the mean ﬁeld approximation in narrow channels implies that any theory that aspires to relate channel structure to its function must treat ions explicitly. In sections 4.4, 4.5, and 4.6 of this chapter, we discuss the permeation problem for ion channels. We show how Brownian dynamics simulation can be used to model the propagation of individual ions. We also show how stochastic gradient learning based schemes can be used to control the evolution of Brownian dynamics simulation to predict the molecular structure of an ion channel. We refer the reader to our recent research (Krishnamurthy and Chung, a,b) where a detailed exposition of the resulting adaptive Brownian dynamics simulation algorithm is given. Furthermore numerical results presented in Krishnamurthy and Chung (a,b) for antibiotic Gramicidin-Aion channels show that the estimates obtained from the adaptive Brownian dynamics algorithm are consistent with the known molecular structure of Gramicidin-A. An important underlying theme of this chapter is the ubiquitous nature of sensor adaptive signal processing. This transcends standard statistical signal processing, which deals with extracting signals from noisy observations, to examine the deeper problem of how to dynamically adapt the sensor to optimize the performance of the signal-processing algorithm. That is, the sensors dynamically modify their behavior to optimize their performance in extracting the underlying signal from noisy observations. A crucial aspect in sensor adaptive signal processing is feedback—past decisions of adapting the sensor aﬀect future observations. Such sensor adaptive signal processing has recently been used in defense networks (Evans et al., 2001; Krishnamurthy, 2002, 2005) for scheduling sophisticated multimode sensors in unattended ground sensor networks, radar emission control, and adaptive radar beam allocation. In this chapter we show how the powerful paradigm of sensor adaptive signal processing can be successfully applied to biological ion channels both at the macroscopic and nano scales.

The Gating Problem and Estimating the Nernst Potential of Ion Channels In this section we ﬁrst outline the well-known hidden Markov model (HMM) for modeling the ion channel current in the gating problem. We refer the reader to the paper by Krishnamurthy and Chung (2003) for a detailed exposition. Estimating the underlying ion channel current from the noisy HMM observations is a wellstudied problem in HMM signal processing (Ephraim and Merhav, 2002; James et al., 1996; Krishnamurthy and Yin, 2002). In this section, consistent with the theme of sensor adaptive signal processing, we address the deeper issue of how to dynamically control the behavior of the ion channels to extract maximal information about their behavior. In particular we propose two novel applications of stochastic control for adapting the behavior of the ion channel. Such ideas are also relevant in other applications such as sensor scheduling in defense networks (Krishnamurthy, 2002, 2005).

80

Sensor Adaptive Signal Processing of Biological Nanotubes

4.2.1 ion channel current

Hidden Markov Model Formulation of Ion Channel Current

The patch clamp is a device for isolating the ion channel current from a single ion channel. A typical trace of the ion channel current measurement from a patchclamp experiment (after suitable anti-aliasing ﬁltering and sampling) shows that the channel current is a piecewise constant discrete time signal that randomly jumps between two values—zero amperes, which denotes the closed state of the channel, and I(θ) amperes (typically a few pico-amperes), which denotes the open state. I(θ) is called the open-state current level. Sometimes the current recorded from a single ion channel dwells on one or more intermediate levels, known as conductance substates. Chung et al. (1990, 1991) ﬁrst introduced the powerful paradigm of hidden Markov models (HMMs) to characterize patch-clamp recordings of small ion channel currents contaminated by random and deterministic noise. By using sophisticated HMM signal-processing methods, Chung et al. (1990, 1991) demonstrated that the underlying parameters of the HMM could be obtained to a remarkable precision despite the extremely poor signal-to-noise ratio. These HMM parameter estimates yield important information about the dynamics of ion channels. Since the publications of Chung et al. (1990, 1991), several papers have appeared in the neurobiological community that generalize the HMM signal models in Chung et al. (1990, 1991) in various ways to model measurements of ion channels (see the paper of Venkataramanan et al. (2000) and the references therein). With these HMM techniques, it is now possible for neurobiologists to analyze not only large ion channel currents but also small conductance ﬂuctuations occurring in noise. Markov Model for Ion Channel Current Suppose a patch-clamp experiment is conducted with a voltage θ applied across the ion channel. Then, as described in Chung et al. (1991) and in Venkataramanan et al. (2000), the ion channel current {in (θ)} can be modeled as a three-state homogeneous ﬁrst-order Markov chain. The state space of this Markov chain is {0g , 0b , I(θ)}, corresponding to the physical states of gap mode, burst-mode-closed, and burst-mode-open. For convenience, we will refer to the burst-mode-closed and burst-mode-open states as the closed and open states, respectively. In the gap mode and the closed state, the ion channel current is zero. In the open state, the ion channel current has a value of I(θ). (θ,λ) The 3 × 3 transition probability matrix A(θ) of the Markov chain {In (θ)}, which governs the probabilistic behavior of the channel current, is given by 0g A(θ) =

0b

I(θ)

0g

a11 (θ) a12 (θ)

0b

a21 (θ) a22 (θ) a23 (θ)

I(θ)

0

0

(4.1)

a32 (θ) a33 (θ) (θ,λ)

The elements of A(θ) are the transition probabilities aij (θ) = P (In+1 (θ) = (θ,λ) j|In (θ) = i) where i, j ∈ {0g , 0b , I(θ)}. The zero probabilities in the above

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

hidden Markov model (HMM)

81

matrix A(θ) reﬂect the fact that a ion channel current cannot directly jump from the gap mode to the open state; similarly an ion channel current cannot jump from the open state to the gap mode. Note that in general, the applied voltage θ aﬀects both the transition probabilities and state levels of the ion channel current (θ,λ) {In (θ)}. Hidden Markov Model (HMM) Observations Let {yn (θ)} denote the measured noisy ion channel current at the electrode when conducting a patch-clamp experiment: yn (θ) = in (θ) + wn (θ),

n = 1, 2, . . .

(4.2)

Here {wn (θ)} is thermal noise and is modeled as zero-mean white Gaussian noise with variance σ 2 (θ). Thus the observation process {yn (θ)} is a hidden Markov model (HMM) sequence parameterized by the model λ(θ) = {A(θ), I(θ), σ 2 (θ)}

(4.3)

where θ denotes the applied voltage. We remark here that the formulation trivially extends to observation models where the noise process wn (θ) includes a time-varying deterministic component together with white noise — only the HMM parameter estimation algorithm needs to be modiﬁed as in Krishnamurthy et al. (1993). HMM Parameter Estimation of Current Level I(θ) Given the HMM mode for the ion channel current above, estimating I(θ) for a ﬁxed voltage θ involves processing the noisy observation {yn (θ)} through a HMM maximum likelihood parameter estimator. The most popular way of computing the maximum likelihood estimate (MLE) I(θ) is via the expectation maximization (EM) algorithm (Baum Welch equations). The EM algorithm is an iterative algorithm for computing the MLE. It is now fairly standard in the signal-processing and neurobiology literature—see Ephraim and Merhav (2002) for a recent exposition, or Chung et al. (1991), which is aimed at neurobiologists. Let IˆΔ (θ) denote MLE of I(θ) based on the Δ-point measured channel current sequence (y1 (θ), . . . , yΔ (θ)). For suﬃciently large batch size Δ of observations, due to the asymptotic normality of the MLE for a HMM (Bickel et al., 1998), √

Δ IˆΔ (θ) − I(θ) ∼ N (0, Σ(θ)), (4.4) where Σ−1 (θ) is the Fisher information Thus asymptotically IˆΔ (θ) is matrix. an unbiased estimator of I(θ), i.e., E IˆΔ (θ) = I(θ) where E {·} denotes the mathematical expectation operator. 4.2.2

Nernst Potential and Discrete Stochastic Optimization for Ion Channels

To record currents from single ion channels, the tip of an electrode, with the diameter of about 1 μm, is pushed against the surface of a cell, and then a tight

82

Sensor Adaptive Signal Processing of Biological Nanotubes

seal is formed between the rim of the electrode tip and the cell membrane. A patch of the membrane surrounded by the electrode tip usually contains one or more single ion channels. The current ﬂowing from the inside of the cell to the tip of the electrode through a single ion channel is monitored. This is known as cell-attached conﬁguration of patch-clamp techniques for measuring ion channel currents through a single ion channel. Figure 4.1 shows the schematic setup of the cell in electrolyte and the electrode pushed against the surface of the cell.

−

co Eo

+

Electrode

cell ci , Ei

Figure 4.1

Nernst potential

Cell-attached patch experimental setup.

In a living cell, there is a potential diﬀerence between its interior and the outside environment, known as the membrane potential. Typically, the cell interior is about 60 mV more negative with respect to outside. Also, the ionic concentrations (mainly Na+ , Cl− , and K+ ) inside of a cell are very diﬀerent from outside of the cell. In the cell-attached conﬁguration, the ionic strength in the electrode is usually made the same as that in the outside of the cell. Let Ei and Eo , respectively, denote the resting membrane potential and the potential applied to the electrode. If Eo is identical to the membrane potential, there will be no potential gradient across the membrane patch conﬁned by the tip of the electrode. Let ci denote the intracellular ionic concentration and co the ionic concentration in the electrode. Here the intracellular concentration ci inside the cell is unknown as is the resting membrane potential Ei . co and Eo are set by the experimenter and are known. Let θ = Eo − Ei denote the potential gradient. Both the potential gradient θ and concentration gradient co − ci drive ions across an ion channel, resulting in (θ,λ) an ion channel current {In (θ)}. This ion channel current is a piecewise constant signal that jumps between the values of zero and I(θ), where I(θ) denotes the current when the ion channel is in the open state. The potential Eo (and hence potential diﬀerence θ) is adjusted experimentally until the current I(θ) goes to zero. This voltage θ∗ at which the current I(θ∗ ) vanishes is called the Nernst potential and satisﬁes the so-called Nernst equation θ∗ = −

co co kT ln = −59 log10 (mV), e ci ci

(4.5)

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

83

where e = 1.6 × 10−19 C denotes the charge of an electron, k denotes Boltzmann’s constant, and T denotes the absolute temperature. The Nernst equation (4.5) gives the potential diﬀerence θ required to maintain electrochemical equilibrium when the concentrations are diﬀerent on the two faces of the membrane. Estimating the Nernst potential θ∗ requires conducting experiments at diﬀerent values of voltage θ. In patch-clamp experiments, the applied voltage θ is usually chosen from a ﬁnite set. Let θ ∈ Θ = {θ(1), . . . , θ(M )} denote the ﬁnite set of possible voltage values that the experimenter can pick. For example, in typical experiments, if one needs to determine the Nernst potential to a resolution of 4 mV, then M = 80 and θ(i) are uniformly spaced in 4 mV steps from θ(1) = −160 mV and θ(M ) = 160 mV. Note that the Nernst potential θ∗ (zero crossing point) does not necessarily belong to the discrete set Θ—instead we will ﬁnd the point in Θ that is closest to θ∗ (with resolution θ(2) − θ(1)). With slight abuse of notation we will denote the element in Θ closest to the Nernst potential as θ∗ . Thus determining θ∗ ∈ Θ can be formulated as a discrete optimization problem: θ∗ = arg min |I(θ)|2 . θ∈Θ

discrete stochastic approximation

Discrete Stochastic Approximation Algorithm Learning the Nernst Potential can be formulated as the following discrete stochastic optimization problem 2 ˆ , Compute θ = arg min E I(θ) ∗

θ∈Θ

(4.6)

ˆ where I(θ) is the MLE of the parameter I(θ) of the HMM. Since for a HMM, no closed-form expression is available for Σ−1 (θ) in equation 4.4, the above expectation cannot be evaluated analytically. This motivates the need to develop a simulationbased (stochastic approximation) algorithm. We refer the reader to the paper by Krishnamurthy and Chung (2003) for details. The idea of discrete stochastic approximation (Andradottir, 1999) is to design a plan of experiments which provides more observations in areas where the Nernst potential is expected and less in other areas. More precisely what is needed is a dynamic resource allocation (control) algorithm that dynamically controls (schedules) the choice of voltage at which the HMM estimator operates in order to eﬃciently obtain the zero point and deduce how the current increases or decreases as the applied voltage deviates from the Nernst potential. We propose a discrete stochastic approximation algorithm that is both consistent and attracted to the Nernst potential. That is, the algorithm should spend more time gathering observations {yn (θ)} at the Nernst potential θ = θ∗ and less time for other values of θ ∈ Θ. Thus in discrete stochastic approximation the aim is to devise an eﬃcient (Pﬂug, 1996, chapter 5.3) adaptive search (sampling plan) which allows ﬁnding the mini-

84

Sensor Adaptive Signal Processing of Biological Nanotubes

mizer θ∗ with as few samples as possible by not making unnecessary observations at nonpromising values of θ. Here we construct algorithms based on the random search procedures of Andradottir (1995, 1999). The basic idea is to generate a homogeneous Markov chain taking values in Θ which spends more time at the global optimum than at any other element of Θ. We will show that these algorithms can be modiﬁed for tracking time-varying Nernst potentials. Finally, it is worthwhile mentioning that there are other classes of simulation-based discrete stochastic optimization algorithms, such as nested partition methods (Swisher et al., 2000), which combine partitioning, random sampling, and backtracking to create a Markov chain that converges to the global optimum. Let n = 1, 2, . . . denote discrete time. The proposed algorithm is recursive and requires conducting experiments on batches of data. Since experiments will be conducted over batches of data, it is convenient to introduce the following notation. Group the discrete time into batches of length Δ—typically Δ = 10, 000 in experiments. We use the index N = 1, 2, . . . to denote batch number. Thus batch N comprises the Δ discrete time instants n ∈ {N Δ, N Δ + 1, . . . , (N + 1)Δ − 1}. Let DN = (DN (1), . . . , DN (M )) denote the vector of duration times the algorithm spends at the M possible potential values in Θ. Finally, for notational convenience deﬁne the M dimensional unit vectors, em , m = 1, . . . , M as em = 0 · · · 0 1 0 · · · 0 , (4.7) with 1 in the mth position and zeros elsewhere. The discrete stochastic approximation algorithm of Andradottir (1995) is not directly applicable to the cost function 4.6, since it applies to optimization problems of the form minθ∈Θ E {C(θ)}. However, equation 4.6 can easily be converted to this form as follows: Let Iˆ1 (θ), Iˆ2 (θ) be two statistically independent unbiased HMM ˆ estimates of I(θ). Then deﬁning C(θ) = Iˆ1 (θ)Iˆ2 (θ), it straightforwardly follows that 2 ˆ ˆ E C(θ) = E I(θ) = |I(v)|2 . (4.8) The discrete stochastic approximation algorithm we propose is as follows: Algorithm 4.1 Algorithm for Learning Nernst Potential Step 0 Initialization: At batch-time N = 0, select starting point X0 ∈ {1, . . . , M } randomly. Set D0 = eX0 , Set initial solution estimate θ0∗ = θ(X0 ). ˜ N ∈ {XN − 1, XN + 1} with uniform Step 1 Sampling: At batch-time N , sample X distribution. ˜ N ) to patch-clamp Step 2 Evaluation and acceptance: Apply voltage θ˜ = θ(X (1) ˜ and experiment. Obtain two Δ length batches of HMM observations. Let IˆN (θ) (2) ˜ ˆ IN (θ) denote the HMM-MLE estimates for these two batches, which are computed using the EM algorithm (James et al., 1996; Krishnamurthy and Chung, 2003). Set ˜ = Iˆ(1) (θ) ˜ Iˆ(2) (θ). ˜ CˆN (θ)) N N

4.2

The Gating Problem and Estimating the Nernst Potential of Ion Channels

85

Then apply voltage θ = θ(XN ). Compute the HMM-MLE estimates for these (1) (2) (1) (2) two batches, denoted as IˆN (θ) and IˆN (θ). Set CˆN (θ)) = IˆN (θ)IˆN (θ). ˜ < CˆN (θ), set XN +1 = X ˜ N , else, set XN +1 = XN . If CˆN (θ) Step 3 Update occupation probabilities of XN : DN +1 = DN + eXN +1 . ∗ Step 4 Update estimate of Nernst potential: θN = θ(m∗ ) where m∗ = arg

max

m∈{1,... ,M }

DN +1 (m).

Set N → N + 1. Go to step 1. The proof of convergence of the algorithm is given in theorem 4.1 below. The main idea behind the above algorithm is that the sequence {XN } (or equivalently {θ(XN )}) generated by steps 1 and 2 is a homogeneous Markov chain with state space {1, . . . , M } (respectively, Θ) that is designed to spend more time at the global ∗ denotes the estimate maximizer θ∗ than any other state. In the above algorithm, θˆN of the Nernst potential at batch N . Interpretation of Step 3 as Decreasing Step Size Adaptive Filtering Algorithm Deﬁne the occupation probability estimate vector as π ˆN = DN /N . Then the update in step 3 can be reexpressed as ˆ0 = eX0 . π ˆN +1 = π ˆN + μN +1 eXN +1 − π ˆN , π (4.9) This is merely an adaptive ﬁltering algorithm for updating πˆN with decreasing step size μN = 1/N . Hence algorithm 4.1 can be viewed as a decreasing step size algorithm which involves a least mean squares (LMS) algorithm (with decreasing step size) in tandem with a random search step and evaluation (steps 1 and 2) for generating Xm . Figure 4.2 shows a schematic diagram of the algorithm with this LMS interpretation for step 3. In Andradottir (1995), the following stochastic ordering assumption was used for convergence of the algorithm 4.1. (O) For any m ∈ {1, . . . , M − 1},

ˆ ˆ + 1)) > C(θ(m)) > 0.5, I 2 (θ(m + 1)) > I 2 (θ(m)) =⇒ P C(θ(m ˆ ˆ + 1)) > C(θ(m)) < 0.5 I 2 (θ(m + 1)) < I 2 (θ(m)) =⇒ P C(θ(m

Step 1. Sample X˜ N from {XN –1, XN +1}

Figure 4.2

X˜ N

Step 2. Run patch clamp expt at voltages θ (XN) and θ (X˜ N).

Step 2. HMM MLE estimator. Evaluate CˆN (θ˜ ) ˆ θ ). and C(

Schematic of algorithm 4.1.

XN

Step 3. Adaptive filter step size μN

πˆM

max

θˆN*

86

Sensor Adaptive Signal Processing of Biological Nanotubes

Theorem 4.1 Under the condition (O) above, the sequence {θ(XN )} generated by algorithm 4.1 is a homogeneous, aperiodic, irreducible Markov chain with state space Θ. Furthermore, algorithm 4.1 is attracted to the Nernst potential θ∗ , i.e., for suﬃciently large N , the sequence {θ(XN )} spends more time at θ∗ than at another state. (Equivalently, if θ(m∗ ) = θ∗ , then DN (m∗ ) > DN (j) for j ∈ {1, . . . , M } − {m∗ }.) The above discrete stochastic approximation algorithm can be viewed as the discrete analog of the well-known LMS algorithm. Recall that in the LMS algorithm, the new estimate is computed from the previous estimate by moving along a desirable search direction (based on gradient information). In complete analogy with the above discrete search algorithm, the new estimate is obtained by moving along a discrete search direction to a desirable new point. We refer the reader to Krishnamurthy and Chung (2003) and our recent papers (Krishnamurthy et al., 2004; Yin et al., 2004) for complete convergence details of the above discrete stochastic approximation algorithm.

4.3

Scheduling Multiple Ion Channels on a Biological Chip In this section, we consider dynamic scheduling and control of the gating process of ion channels on a biological chip. Patch clamping has rapidly become the “gold standard” (Fertig et al., 2002) for study of the dynamics of ion channel function by neurobiologists. However, patch clamping is a laborious process requiring precision micromanipulation under high-power visual magniﬁcation, vibration damping, and an experienced, skillful experimenter. Because of this, high-throughput studies required in proteomics and drug development have to rely on less valuable methods such as ﬂuorescence-based measurement of intracellular ion concentrations (Xu et al., 2001). There is thus signiﬁcant interest in an automated version of the whole patch-clamp principle, preferably one that has the potential to be used in parallel on a number of cells. In 2002, Fertig et al. (2002) made a remarkable invention—the ﬁrst successful demonstration of a patch clamp on a chip—a planar quartz-based biological chip that consists of several hundred ion channels (Sigworth and Klemic, 2002). This patch-clamp chip can be used for massively parallel screens for ion channel activity, thereby providing a high-throughput screening tool for drug discovery eﬀorts. Typically, because of their high cost, most neurobiological laboratories have only one patch-clamp ampliﬁer that can be connected to the patch-clamp chip. As a result, only one ion channel in the patch-clamp chip can be monitored at a given time. It is thus of signiﬁcant interest to devise an adaptive scheduling strategy that dynamically decides which single ion channel to activate at each time instant in order to maximize the throughput (information) from the patch-clamp experiment. Such a scheduling strategy will enable rapid evaluation and screening of drugs. Note that this problem directly ﬁts into our main theme of sensor adaptive signal

Scheduling Multiple Ion Channels on a Biological Chip

87

LASER

4.3

Cell Membrane

Ion Channel

Caged Glutamate

Electrolyte Solution

− +

Stochastic Scheduler

Amplifier

Figure 4.3

One-dimensional section of planar biological chip.

processing. Here we consider the problem of how to dynamically schedule the activation of individual ion channels using a laser beam to maximize the information obtained from the patch-clamp chip for high-throughput drug evaluation. We refer the reader to Krishnamurthy (2004) for a detailed exposition of the problem together with numerical studies. The ion channel activation scheduling algorithm needs to dynamically plan and react to the presence of uncertain (random) dynamics of the individual ion channels in the chip. Moreover, excessive use of a single ion channel can make it desensitized. The aim is to answer the following question: How should the ion channel activation scheduler dynamically decide which ion channel on the patch clamp chip to activate at each time instant in order to minimize the overall desensitization of channels while simultaneously extracting maximum information from the channels? We refer the reader to Fertig et al. (2002) for details on the synthesis of a patchclamp chip. The chip consists of a quartz substrate of 200 micrometers thickness that is perforated by wet etching techniques resulting in apertures with diameters of approximately 1 micrometer. The apertures replace the tip of glass pipettes commonly used for patch-clamp recording. Cells are positioned onto the apertures from suspension by application of suction. A schematic illustration of the ion channel scheduling problem for the patchclamp chip is given in ﬁg. 4.3. The ﬁgure shows a cross section of the chip with 4 ion channels. The planar chip could, for example, consist of 50 rows each containing 4 ion channels. Each of the four wells contains a membrane patch with an ion channel. The external electrolyte solutions contain caged ligands (such as caged glutamate). When a beam of laser is directed at the well, the inert caged ligands become

88

Sensor Adaptive Signal Processing of Biological Nanotubes

active ligands that cause a channel to go from the closed conformation to an open conformation. Ions then ﬂow across the open channel, and the current generated by the motion of charged particles is monitored with a patch-clamp ampliﬁer. The ampliﬁer is switched to the output of one well to another electronically. Typically, the magnitude of currents across each channel, when it is open, is about 1 pA (10−12 A). The design of the ion channel activation scheduling algorithm needs to take into account the following subsystems. Heterogeneous ion channels (macro-molecules) on chip: In a patch-clamp chip, the dynamical behavior of individual ion channels that are activated changes with time since they can become desensitized due to excessive use. Deactivated ion channels behave quite diﬀerently from other ion channels. Their transition to the open state becomes less frequent when they are de-sensitized due to excessive use. Patch-clamp ampliﬁer and heterogeneous measurements: The channel current of the activated ion channel is of the order of pico-amps and is measured in large amounts of thermal noise. Chung et al. (1990, 1991), used the powerful paradigm of HMMs to characterize these noisy measurements of single ion channel currents. The added complexity in the patch-clamp chip is that the signal-to-noise ratio is diﬀerent at diﬀerent parts of the chip—meaning that certain ion channels have higher SNR than other ion channels. Ion channel activation scheduler: The ion channel activation scheduler uses the noisy channel current observations of the activated ion channel in the patch-clamp chip to decide which ion channel to activate at the next time instant to maximize a reward function that comprises the information obtained from the experiment. It needs to avoid activating desensitized channels, as they yield less information. 4.3.1

Stochastic Dynamical Models for Ion Channels on Patch-Clamp Chip

In this section we formulate a novel Markov chain model for the ion channels that takes into account both the ion channel current state and the ion channel sensitivity. The patch-clamp chip consists of P ion channels arranged in a two-dimensional grid indexed by p = 1, . . . , P . Let k = 0, 1, 2, . . . , denote discrete time. At each time instant k the scheduler decides which single ion channel to activate by directing a laser beam on the ion channel as described above. Let uk ∈ {1, . . . , P } denote the ion channel that is activated by the scheduler at time k. The remaining P − 1 ion channel channels on the chip are inactive. It is the job of the dynamic scheduler to dynamically decide which ion channel should be activated at each time instant k in order to maximize the amount of information that can be obtained from the chip. If channel p is active at time k, i.e., uk = p, the following two mechanisms determine the evolution of this active ion channel. Ion Channel Sensitivity Model The longer the channel is activated, the more (p) probably it becomes desensitized. Let dk ∈ {normal, de-sens} denote the sensitivity of ion channel p at any time instant k. If ion channel p is activated at time

4.3

Scheduling Multiple Ion Channels on a Biological Chip

89

(p)

k, i.e, uk = p, then dk can be modeled as a two-state Markov chain with state transition probability matrix normal de-sens D = normal de-sens

0 ≤ d11 ≤ 1.

1 − d11 ,

d11 0

(4.10)

1

The above transition probabilities reﬂect the fact that if the channel is overused it becomes desensitized with probability d12 . The 0, 1 in the second row imply that once the channel is desensitized, it remains de-sensitized. Note that the sensitivity (q) (q) of the inactive channels remains ﬁxed, i.e., dk+1 = dk , q = p. Ion Channel Current Model Suppose channel p is active at time k, i.e., uk = p. (p) Let ik ∈ {0, I} = {closed, open} denote the channel current. As is well known (Chung et al., 1991), the channel current is a binary valued signal that switches between zero “closed state” and the current level I “open state.” The open-state current level I is of importance to neurobiologists since it quantiﬁes the eﬀect of (p) a drug on the ion channel. Moreover, ik can be modeled as a two-state Markov (p) chain (Chung et al., 1991) conditional on dk with transition probability matrix (p) (p) (p) Q = (P (ik+1 |ik , dk+1 )) given by (p) Q(dk+1

(p) Q(dk+1

closed

open

= normal) = closed

q11

q12

open

q21

q22

closed

open

= de-sens) = closed

q¯11

q¯12 .

open

q¯21

q¯22

(4.11)

For each ion channel p ∈ {1, . . . , P } on the patch clamp chip, deﬁne the ion (p) (p) (p) channel state as the vector Markov process sk = (dk , ik ) with state space {(normal,closed), (normal,open), (de-sens,closed), (de-sens,open)} = {1, 2, 3, 4} where for notational convenience we have mapped the four states to {1, 2, 3, 4}. It is clear (u ) that only the state sk k of the ion channel that is activated evolves with time. Since (p)

(p)

(p)

(p)

(p)

(p)

(p)

P (sk+1 |sk ) = P (dk+1 |dk )P (ik+1 |ik , dk+1 ) (p)

if channel p is active at time k, i.e., uk = p, then sk has transition probability matrix ⎤ ⎡ d11 q11 d11 q12 (1 − d22 )¯ q11 (1 − d11 )¯ q12 ⎥ ⎢ ⎢d11 q21 d11 q22 d12 q¯21 d12 q¯22 ⎥ (p) ⎥. ⎢ (4.12) A =⎢ ⎥ 0 q¯11 q¯12 ⎦ ⎣ 0 0 0 q¯21 q¯22

90

Sensor Adaptive Signal Processing of Biological Nanotubes (p)

More generally one can assume that the state sk of each ion channel p has a (p) ﬁnite number of values Np (instead of just four states). If uk = p, the state sk of ion channel p evolves according to an Np -state homogeneous Markov chain with transition probability matrix

(p) (p) (p) (4.13) A(p) = (aij )i,j∈Np = P sk+1 = j | sk = i if ion channel p is active at time k. The states of all the other (P − 1) ion channels (q) (q) that are not activated are unaﬀected, i.e., sk+1 = sk , q = p. To complete our probabilistic formulation, assume the initial states of all ion channels on the chip (p) (p) (p) are initialized with prior distributions: s0 ∼ x0 where x0 , are speciﬁed initial distributions for p = 1, . . . , P . The above formulation captures the essence of an activation controlled patchclamp chip—the channel activation scheduler dynamically decides which single ion channel to activate at each time instant. 4.3.2

Patch-Clamp Ampliﬁer and Hidden Markov Model Measurements (p)

The state of the active ion channel sk on the chip is not directly observed. Instead, (p) the output of the patch-clamp ampliﬁer is the ion channel current ik observed in large amounts of thermal noise. This output is quantized to an M symbol (p) alphabet set yk ∈ {O1 , O2 , . . . , OM }. The probabilistic relationship between the (p) (p) observations yk and the actual ion channel state sk of the active ion channel p is summarized by the (Np × M) state likelihood matrix: (p)

B (p) = (bim )i∈Np ,m∈M , (p)

(p)

(4.14)

(p)

where bim = P (yk+1 = Om |sk+1 = i, uk = p) denotes the conditional probability (p) (symbol probability) of the observation symbol yk+1 = Om when the actual state is (p) sk+1 = i and the active ion channel is uk = p. Note that the above model allows for (p) the state likelihood probabilities (bim ) to vary with p, i.e., to vary with the spatial location of the ion channel on the patch-clamp chip, thus allowing for spatially heterogeneous measurement statistics. (u ) (u ) Let Yk = (y1 0 , . . . , yk k−1 ) denote the observed history up to time k. Let Uk = (u0 , . . . , uk ) denote the sequence of past decisions made by the ion channel activation scheduler regarding which ion channels to activate from time 0 to time k. 4.3.3

Ion Channel Activation Scheduler

The above probabilistic model for the ion channel, together with the noisy measurements from the patch-clamp ampliﬁer, constitute a well-known type of dynamic Bayesian network called a hidden Markov model (HMM) (Ephraim and Merhav, (p) 2002). The problem of state inference of a HMM, i.e., estimating the state sk given

4.3

Scheduling Multiple Ion Channels on a Biological Chip

91

(Yk , Uk ), has been widely studied. (see e.g., Chung et al. (1991); Ephraim and Merhav (2002)). In this chapter we address the deeper and more fundamental issue of how the ion channel activation scheduler should dynamically decide which ion channel to activate at each time instant in order to minimize a suitable cost function that encompasses all the ion channels. Such dynamic decision making based on uncertainty (noisy channel current measurements) transcends standard sensor-level HMM state inference, which is a well-studied problem (Chung et al., 1991). The activation scheduler decides which ion channel to activate at time k, based on the optimization of a discounted cost function which we now detail: The instantaneous cost incurred at time k due to all the ion channels (both active and inactive) is (u ) (p) r(sk , p), (4.15) Ck = −c0 (uk ) + c1 (sk k , uk ) + p=uk (u )

where −c0 (uk ) + c(sk k , uk ) denotes the cost incurred by the active ion channel uk , & (p) and p=uk r(sk , p) denotes the cost of remaining P − 1 inactive ion channels. The three components in the above cost function 4.15, can be chosen by the neurobiologist experimenter to optimize the information obtained from the patchclamp experiment. Here we present one possible choice of costs: Ion channel quality of service (QoS): c0 (p) denotes the quality of service of the active ion channel p. The minus signs in equation 4.15 reﬂects the fact that the lower the QoS the higher the cost and vice versa. State information cost: The ﬁnal outcome of the patch-clamp experiment is often the estimate of the open-state level I. The accuracy of this estimate increases linearly with the number of observations obtained in the open state (since the covariance error of the estimate decreases linearly with the data length according to the central limit theorem). Maximizing the accuracy I requires maximizing the utilization of the patch-clamp chip, i.e., maximizing the expected number of measurements made from ion channels that are in the open normal state. That is, preference should be given to activating ion channels that are normal (i.e., not desensitized) and that quickly switch to the open state compared to other ion channels. (p) Desensitization cost of inactive channels: The instantaneous cost r(sk , p) in equation 4.15 incurred by each of the P −1 inactive ion channels p ∈ {1, 2, . . . , P }−{uk } should be chosen so as to penalize desensitized channels. (u ) (u ) Based on the observed history Yk = (y1 0 , . . . , yk k−1 ), and the history of decisions Uk−1 = (u0 , . . . , uk−1 ), the scheduler needs to decide which ion channel on the chip to activate at time k. The scheduler decides which ion channel to activate at time k based on the stationary policy μ : (Yk , Uk−1 ) → uk . Here μ is a function that maps the observation history Yk and past decisions Uk−1 to the choice of which ion channel uk to activate at time k. Let U denote the class of admissible stationary policies, i.e., U = {μ : uk = μ(Yk , Uk−1 )}. The total expected discounted

92

Sensor Adaptive Signal Processing of Biological Nanotubes

reward over an inﬁnite time horizon is given by ( '∞ k Jμ = E β Ck ,

(4.16)

k=0

inﬁnite horizon discounted cost POMDP

where Ck is deﬁned in equation 4.15 and E {·} denotes mathematical expectation. The aim of the scheduler is to determine the optimal stationary policy μ∗ ∈ U which minimizes the cost in equation 4.16. The above problem of minimizing the inﬁnite horizon discounted cost 4.16 of stochastic dynamical system 4.13 with noisy observations (equation 4.14) is a partially observed Markov decision process (POMDP) problem. Developing numerically eﬃcient ion channel activation scheduling algorithms to minimize this cost is the subject of the rest of this section. 4.3.4

Formulation of Activation Scheduling as a Multiarmed Bandit

The above stochastic control problem (eq. 4.16) is an inﬁnite-horizon partially observed Markov decision process with a multiarmed bandit structure which considerably simpliﬁes the solution. But ﬁrst, as is standard with partially observed stochastic control problems—we convert the partially observed multiarmed bandit problem to a fully observed multiarmed bandit problem deﬁned in terms of the information state (Bertsekas, 1995a). 4.3.5

information state

HMM ﬁlter

Information State Formulation

For each ion channel p, the information state at time k—which we will denote by (p) xk (column vector of dimension Np )—is deﬁned as the conditional ﬁltered density (p) (p) of the Markov chain state sk given Yk and Uk−1 :

(p) (p) (4.17) xk (i) = P sk = i | Yk , Uk−1 , i = 1, . . . , Np . The information state can be computed recursively by the HMM state ﬁlter, which is also known as the forward algorithm or Baum’s algorithm (James et al., 1996), according to equation 4.18 below. In terms of the information state formulation, the ion channel activation scheduling problem described above can be viewed as the following dynamic scheduling problem: Consider P parallel HMM state ﬁlters, one for each ion channel (p) on the chip. The pth HMM ﬁlter computes the state estimate (ﬁltered density) xk of the pth ion channel, p ∈ {1, . . . , P }. At each time instant, only one of the P (p) ion channels is active, say ion channel p, resulting in an observation yk+1 . This is processed by the pth HMM state ﬁlter, which updates its Bayesian estimate of the ion channel’s state as (p)

(p) xk+1

=

(p)

B (p) (yk+1 )A(p) xk

1 B (p) (yk+1 )A(p) xk (p)

(p)

if ion channel p is active,

(4.18)

4.3

Scheduling Multiple Ion Channels on a Biological Chip (p)

93 (p)

(p)

where if yk+1 = Om , then B (p) (m) = diag[b1m , . . . , bNp ,m ] is the diagonal matrix formed by the mth column of the observation matrix B (p) and 1 is an Np dimensional column unit vector (we use to denote transpose). The state estimates of the other P − 1 HMM state ﬁlters remain unaﬀected, i.e., if ion channel q is inactive, (q)

(q)

xk+1 = xk ,

q ∈ {1, . . . , P }, q = p.

(4.19)

Let X (p) denote the state space of information states x(p) for ion channels p ∈ {1, 2, . . . , P }. That is, for all i ∈ {1, . . . , Np } X (p) = x(p) ∈ RNp : 1 x(p) = 1, 0 < x(p) (i) < 1 . (4.20) Note that X (p) is an (Np − 1)-dimensional simplex. Using the smoothing property of conditional expectations, the cost function 4.16 can be rewritten in terms of the information state as ⎫ ⎧ ⎬ ∞ ⎨ (u ) (p) Jμ = E β k c (uk )xk k + r (p)xk (4.21) ⎭ ⎩ k=0

p=uk

(p)

on-going multiarmed bandit

(p)

where c(uk ) denotes the Nuk -dimensional reward vector [c(sk = 1, uk ), . . . , c(sk = (p) (p) Nuk , uk )] , and r(p) is the Nuk -dimensional reward vector [r(sk = 1, p), . . . , c(sk = Np , p)] . The aim is to compute the optimal policy arg minμ∈U Jμ . In terms of equations 4.18 and 4.21, the multiarmed bandit problem reads thus: Design an optimal dynamic scheduling policy to choose which ion channel to activate and hence which HMM Bayesian state estimator to use at each time instant. As it stands the POMDP problem of equations 4.18, 4.19, and 4.21 or equivalently that of equations 4.16, 4.13, and 4.14 has a special structure: (1) Only one Bayesian HMM state estimator operates according to 4.18 at each time k, or equivalently, only one ion channel is active at a given time k. The remaining (q) P − 1 Bayesian estimates xk remain frozen, or equivalently, the remaining P − 1 ion channels remain inactive. (2) The active ion channel incurs a cost depending on its current state and QoS. Since the state estimates of the inactive ion channels are frozen, the cost incurred by them is a ﬁxed constant depending on the state when they were last active. The above two properties imply that equations 4.18, 4.19, and 4.21 constitute what Gittins (1989) terms as an ongoing multiarmed bandit. It turns out that by a straightforward transformation an ongoing bandit can be formulated as a standard multiarmed bandit. It is well known that the multiarmed bandit problem has a rich structure which results in the ion channel activation scheduling problem decoupling into P independent optimization problems. Indeed, from the theory of multiarmed bandits it follows that the optimal scheduling policy has an indexable rule (Whittle, 1980): (p) for each channel p there is a function γ (p) (xk ) called the Gittins index, which is (p) only a function of the ion channel p and its information state xk , whereby the

94

Sensor Adaptive Signal Processing of Biological Nanotubes

optimal ion channel activation policy at time k is to activate the ion channel with the largest Gittins index, i.e., (p) γ (p) (xk ) . (4.22) activate ion channel q where q = max p∈{1,... ,P }

A proof of this index rule for general multiarmed bandit problems is given by Whittle (1980). Computing the Gittins index is a key requirement for devising an optimal activation policy for the patch-clamp chip. We refer the reader to the paper of Krishnamurthy (2004) for details on how the Gittins index is computed and numerical examples of the performance of the algorithm. Remarks: The indexable structure of the optimal ion channel activation policy (eq. 4.22) is convenient for two reasons: (1) Scalability: Since the Gittins index is computed for each ion channel independently of every other ion channel (and this computation is oﬀ-line), the ion channel activation problem is easily scalable in that we can handle several hundred ion channels on a chip. In contrast, without taking the multiarmed bandit structure into account, the POMDP has NpP underlying states, making it computationally impossible to solve—e.g., for P = 50 channels with Np = 2 states per channel, there are 250 states! (2) Suitability for heterogeneous ion channels: Notice that our formulation of the ion channel dynamics allows for them to have diﬀerent transition probabilities and likelihood probabilities. Moreover, since the Gittins index of an ion channel does not depend on other ion channels, we can meaningfully compare diﬀerent types of ion channels.

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

gramicidin-A

In the previous section we dealt with ion channels at a macroscopic level—both in the spatial and time scales. The permeation problem considered in this and the following two sections seeks to explain the working of an ion channel at an ˚ A(angstrom unit = 10−10 m) spatial scale by studying the propagation of individual ions through the ion channel at a femto second (10−15 timescale). This setup is said to be at a mesoscopic scale since the individual ions (e.g., Na+ ions) are of the order of a few ˚ Ain radius and are comparable in radius to the ion channel. At this mesoscopic level, point charge approximations and continuum electrostatics break down. The discrete ﬁnite nature of each ion needs to be taken into consideration. Also, failure of the mean ﬁeld approximation in narrow channels implies that any theory that aspires to relate channel structure to its function must treat ions explicitly. For convenience we focus in this section primarily on gramicidin-A channels— which are one of the simplest ion channels. Gramicidin-A is an antibiotic produced by Bacillus brevis. It was one of the ﬁrst antibiotics to be isolated in the 1940s (Finkelstein, 1987, p. 130). In submicromolar concentrations it can increase the

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

95

conductance of a bacterial cell membrane (which is a planar lipid bilayer membrane) by more than seven orders of magnitude by the formation of cation selective channels. As a result the bacterial cell is ﬂooded and dies. This property of dramatically increasing the conductance of a lipid bilayer membrane has recently been exploited by Cornell et al. (1997) to devise gramicidin-A channel based biosensors with extremely high gains. The aim of this section and the following two sections is to develop a stochastic dynamical formulation of the permeation problem that ultimately leads to estimating a potential of mean force (PMF) proﬁle for an ion channel by optimizing the ﬁt between the simulated current and the experimentally observed current. In the mesoscopic simulation of an ion channel, we propagate each individual ion using Brownian dynamics (Langevin equation), and the force experienced by each ion is a function of the PMF. As a result of the PMF and external applied potential to the ion channel there is a drift of ions from outside to inside the cell via the ion channel resulting in the simulated current. Determining the PMF proﬁle that optimizes the ﬁt between the mesoscopic simulated current and observed current yields useful information and insight into how an ion channel works at a mesoscopic level. Determining the optimal PMF proﬁle is important for several reasons: First, it yields the eﬀective charge density in the peptides that form the ion channel. This charge density yields insight into the crystal structure of the peptide. Second, for theoretical biophysicists, the PMF proﬁle yields information about the permeation dynamics including information about where the ion is likely to be trapped (called binding sites), the mean velocity of propagation of ions through the channel, and the average conductance of the ion channel. We refer the reader to Krishnamurthy and Chung (a,b) for complete details of the Brownian dynamics algorithm and adaptively controlled Brownian dynamics algorithms for estimating the PMF of ion channels. Also the tutorial paper by Krishnamurthy and Chung (2005) and references therein give a detailed overview of Brownian dynamics simulation for determining the structure of ion channels. 4.4.1

molecular dynamics vs. Brownian dynamics

Levels of Abstraction for Modeling Ion Channels at the Nanoscale

The ultimate aim of theoretical biophysicists is to provide a comprehensive physical description of biological ion channels. At the lowest level of abstraction is the ab initio quantum mechanical approach, in which the interactions between the atoms are determined from ﬁrst-principles electronic structure calculations. Due to the extremely demanding nature of the computations, its applications are limited to very small systems at present. A higher level of modeling abstraction is to use classical molecular dynamics. Here, simulations are carried out using empirically-determined pairwise interaction potentials between the atoms, via ordinary diﬀerential equations (Newton’s equation of motion). However, it is not computationally feasible to simulate the ion channel long enough to see permeation of ions across a model channel. For that purpose, one has to go up one further step in abstraction to

96

ion channel vs. carbon nanotube

Sensor Adaptive Signal Processing of Biological Nanotubes

stochastic dynamics, of which Brownian dynamics (BD) is the simplest form, where water molecules that form the bulk of the system in ion channels are stochastically averaged and only the ions themselves are explicitly simulated. Thus, instead of considering the dynamics of individual water molecules, one considers their average eﬀect as a random force or Brownian motion on the ions. This treatment of water molecules can be viewed as a functional central limit theorem approximation. In BD, it is further assumed that the protein is rigid. Thus, in BD, the motion of each individual ion is modeled as the evolution of a stochastic diﬀerential equation, known as the Langevin equation. A still higher level of abstraction is the Poisson-Nernst-Planck (PNP) theory, which is based on the continuum hypothesis of electrostatics and the mean-ﬁeld approximation. Here, ions are treated not as discrete entities but as continuous charge densities that represent the space-time average of the microscopic motion of ions. For narrow ion channels—where continuum electrostatics does not hold—the PNP theory does not adequately explain ion permeation. Remark: Bio-Nanotube Ion Channel vs. Carbon Nanotube There has recently been much work in the nanotechnology literature on carbon nanotubes and their use in ﬁeld eﬀect transistors (FETs). BD ion channel models are more complex than that of a carbon nanotube. Biological ion channels have radii of between 2 ˚ Aand 6 ˚ A. In these narrow conduits formed by the protein wall, the force impinging on a permeating ion from induced surface charges on the water-protein interface becomes a signiﬁcant factor. This force becomes insigniﬁcant in carbon nanotubes used in FETs with radius of approximately 100 ˚ A, which is large compared to the debye length of electrons or holes in Si. Thus the key diﬀerence is that while in carbon nanotubes point charge approximations and continuum electrostatics holds, in ion channels the discrete ﬁnite nature of each ion needs to be considered. 4.4.2

Brownian Dynamics (BD) Simulation Setup

Figure 4.4 illustrates the schematic setup of Brownian dynamics simulation for permeation of ions through an ion channel. The aim is to obtain structural information, i.e., determine channel geometry and charges in the protein that forms the ion channel. Figure 4.4 shows a schematic illustration of a BD simulation assembly for a particular example of an antibiotic ion channel called a gramicidin-Aion channel. The ion channel is placed at the center of the assembly. The atoms forming the ion channel are represented as a homogeneous medium with a dielectric constant of 2. Then, a large reservoir with a ﬁxed number of positive ions (e.g., K+ or Na+ ions) and negative ions (e.g., Cl− ions) is attached at each end of the ion channel. The electrolyte in the two reservoirs comprises 55 M (moles) of H2 O, and 150 mM concentrations of Na+ and Cl− ions.

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

97

Gramicidin-Aion channel model Gramicidin-Acomprising 2N ions within two cylindrical reservoirs R1 , R2 , connected by the ion channel C .

Figure 4.4

98

Sensor Adaptive Signal Processing of Biological Nanotubes

4.4.3

Mesoscopic Permeation Model of Ion Channel

Our permeation model for the ion channel comprises 2 cylindrical reservoirs R1 and R2 connected by the ion channel C as depicted in ﬁg. 4.4, in which 2N ions are inserted (N denotes a positive integer). In ﬁg. 4.4, as an example we have chosen a gramicidin-Aantibiotic ion channel—although the results below hold for any ion channel. These 2N ions comprise (1) N positively charged ions indexed by i = 1, 2, . . . , N . Of these, N/2 ions indexed by i = 1, 2, . . . N/2 are in R1 , and N/2 ions indexed by i = N/2 + 1, . . . , 2N are in R2 . Each Na+ ion has charge q + , mass m(i) = m+ = 3.8 × 10−26 kg and frictional coeﬃcient m+ γ + , and radius r+ ; and (2) N negatively charge ions indexed by i = N + 1, N + 2, . . . , 2N . Of these, N/2 ions indexed by i = N = 1, . . . 3N/2 are placed in R1 and the remaining N/2 ions indexed by i = 3N/2 + 1, . . . , 2N are placed in R2 . Each negative ion has charge q (i) = q − , mass m(i) = m− , frictional coeﬃcient m− γ − , and radius r− . R = R1 ∪ R2 ∪ C denotes the set comprised of the interior of the reservoirs and ion channel. Let t ≥ 0 denote continuous time. Each ion i, moves in three-dimensional (i) (i) (i) (i) (i) space over time. Let xt = (xt , yt , zt ) ∈ R and vt ∈ R3 denote the po(i) (i) (i) sition and velocity of ion i and time t. The three components xt , yt , zt of (i) xt ∈ R are, respectively, the x, y, and z position coordinates. An external potential Φext λ (x) is applied along the z-axis of ﬁg. 4.4, i.e., with x = (x, y, z), (x) = λz, λ ∈ Λ. Here Λ denotes a ﬁnite set of applied potentials. TypΦext λ ically Λ = {−200, −180, . . . , 0, . . . , 180, 200} mV/m. Due to this applied external potential, the Na+ ions drift from reservoir R1 to R2 via the ion chan(1) (2) (3) (2N ) ∈ R2N and Vt = nel C in ﬁg. 4.4. Let Xt = xt , xt , xt , . . . , xt (1) (2) (3) (2N ) 6N vt , vt , vt , . . . , vt ∈R denote the velocities of all the 2N ions. The position and velocity of each individual ion evolves according to the following continuous-time stochastic dynamical system:

t (i) (i) vs(i) ds, (4.23) xt = x0 + 0

t

t (i) (i) (i) (i) (i) m+ γ + (X(i) )v ds + Fθ,λ (Xs )ds + b+ wt , m+ vt = m+ v0 − s s 0

m− vt

(i)

0

i ∈∈ {1, 2, . . . , N }, (4.24)

t

t (i) (i) (i) (i) = m− v0 − m− γ − (X(i) Fθ,λ (Xs )ds + b− wt , s )vs ds + 0

0

i ∈ {N + 1, N + 2, . . . , 2N }. Langevin equation

(4.25)

Equations 4.24 and 4.25 constitute the well-known Langevin equations and describe (i) the evolution of the velocity vt of ion i as a stochastic dynamical system. The (i) random process {wt } denotes a three-dimensional Brownian motion, which is 2 component-wise independent. The constants b+ and b− are, respectively, b+ = 2 (i) (j) 2m+ γ + kT , b− = 2m− γ − kT . Finally, the noise processes {wt } and {wt }, that drive any two diﬀerent ions, j = i, are assumed to be statistically independent.

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

99

(i)

(i)

In equations 4.24 and 4.25, Fθ,λ (Xt ) = −q (i) ∇x(i) Φθ,λ (Xt ) represents the t

(i)

systematic force acting on ion i, where the scalar-valued process Φθ,λ (Xt ) is the total electric potential experienced by ion i given the position Xt of the 2N ions. The subscript λ is the applied external potential. The subscript θ is a parameter that characterizes the potential of mean force (PMF) proﬁle, which is an important (i) component of Φθ,λ (Xt ). It is convenient to represent the above system (equations 4.23, 4.24, and 4.25) as a vector stochastic diﬀerential equation. Deﬁne the following vector-valued variables: ⎤ ⎡ ⎡ (1) ⎤ ⎡ (N +1) ⎤ ⎡ ⎤ 02N ×1 ⎢ (1) ⎥ vt vt Xt ⎥ ⎢ + wt ⎥ ⎥ ⎢ . ⎥ ⎢ ⎢ +⎥ Vt ⎥ .. ⎥ , Vt− = ⎢ ... ⎥ , wt = ⎢ , Vt+ = ⎢ Vt = ⎢ . ⎥ , ζt = ⎢ ⎣ ⎣ ⎣ Vt ⎦ , ⎦ ⎦ − ⎢ .. ⎥ Vt − ⎣ ⎦ (N ) (2N ) Vt vt vt (2N ) wt ⎡ (1) ⎡ (N +1) ⎤ ⎤ Fθ,λ (Xt ) Fθ,λ (Xt ) + 1 ⎢ ⎢ ⎥ ⎥ F (X ) + t . . ⎢ ⎥ , F− (Xt ) = ⎢ ⎥ , Fθ,λ (Xt ) = m θ,λ .. .. F+ . θ,λ (Xt ) = ⎣ θ,λ ⎣ ⎦ ⎦ − 1 m− Fθ,λ (Xt ) (N ) (2N ) Fθ,λ (Xt ) Fθ,λ (Xt ) (4.26) Then equations 4.23, 4.24, and 4.25 can be written compactly as dζt = Aζt dt + fθ,λ (ζt )dt + Σ1/2 dwt , where Σ1/2 = block diag(06N ×6N , b+ /m+ I3N ×3N , b− /m− I3N ×3N ), ⎡ ⎤ 06N ×6N I6N ×6N ⎢ ⎥ 06N ×1 + ⎢ ⎥ A=⎣ . −γ I3N ×3N 03N ×3N ⎦ , fθ,λ (ζt ) = 06N ×6N Fθ,λ (Xt ) − 0N ×N −γ IN ×N

(4.27)

(4.28)

We will subsequently refer to equations 4.27 and 4.28 as the Brownian dynamics equations for the ion channel. Remark: The BD approach is a stochastic averaging theory framework that models the average eﬀect of water molecules: (i) 1. The friction term mγvt dt captures the average eﬀect of the ions driven by the applied external electrical ﬁeld bumping into the water molecules every few femtoseconds. The frictional coeﬃcient is given from Einstein’s relation. (i) 2. The Brownian motion term wt also captures the eﬀect of the random motion of ions bumping into water molecules and is given from the ﬂuctuation-dissipation theorem.

100

Sensor Adaptive Signal Processing of Biological Nanotubes

4.4.4

Systematic Force Acting on Ions

As mentioned after equation 4.25, the systematic force experienced by ion i is (i)

(i)

Fθ,λ (Xt ) = −q (i) ∇x(i) Φθ,λ (Xt ), t

(i)

where the scalar valued process Φθ,λ (Xt ) denotes the total electric potential experienced by ion i given the position Xt of all the 2N ions. We now give a detailed formulation of these systematic forces. (i) The potential Φθ,λ (Xt ) experienced by each ion i comprises the following ﬁve components: (i)

(i)

(i)

(i)

IW (xt ) + ΦC,i (Xt ) + ΦSR,i (Xt ). Φθ,λ (Xt ) = Uθ (xt ) + Φext λ (xt ) + Φ

(4.29)

(i)

Just as Φθ,λ (Xt ) is decomposed into ﬁve terms, we can similarly decompose the (i) (i) force Fθ,λ (Xt ) = −q∇x(i) Φθ,λ (Xt ) experienced by ion i as the superposition (vector t sum) of ﬁve force terms, where each force term is due to the corresponding potential in equation 4.29—however, for notational simplicity we describe the scalar-valued potentials rather than the vector-valued forces. (i) (i) Note that the ﬁrst three terms in equation 4.29, namely Uθ (xt ), Φext λ (xt ), (i) (i) ΦIW (xt ), depend only on the position xt of ion i, whereas the last two terms in C,i SR,i (Xt ), depend on the distance of ion i to all the other equation 4.29, Φ (Xt ), Φ ions, i.e., the position Xt of all the ions. The ﬁve components in equation 4.29 are now deﬁned. PMF (i)

Potential of mean force (PMF), denoted Uθ (xt ) in equation 4.29, comprises electric forces acting on ion i when it is in or near the ion channel (nanotube C in (i) ﬁg. 4.4). The PMF Uθ is a smooth function of the ion position xt and depends on the structure of the ion channel. Therefore, estimating Uθ (·) yields structural information about the ion channel. In section 4.6, we outline an adaptive Brownian dynamics approach to estimate the PMF Uθ (·). The PMF Uθ originates from two diﬀerent sources; see Krishnamurthy and Chung (2005) for details. First, there are ﬁxed charges in the channel protein, and the electric ﬁeld emanating from them renders the pore attactive to cations and repulsive to anions, or vice versa. Some of the amino acids forming the ion channels carry the unit or partial electronic charges. For example, glutamate and aspartate are acidic amino acids, being negatively charged at pH 6.0, whereas lysine, arginine, and histidine are basic amino acids, being positively charged at pH 6.0. Second, when any of the ions in the assembly comes near the protein wall, it induces surface charges of the same polarity at the water-protein interface. This is known as the induced surface charge. External applied potential: In the vicinity of the cell, there is a strong electric ﬁeld resulting from the membrane potential, which is generated by diﬀuse, unpaired, ionic clouds on each side of the membrane. Typically, this resting potential across a cell membrane, whose thickness is about 50 ˚ A, is 70 mV, the cell interior negative

4.4

The Permeation Problem: Brownian Stochastic Dynamical Formulation

101

with respect to the extracellular space. In simulations, this ﬁeld is mimicked by applying a uniform electric ﬁeld across the channel. This is equivalent to placing a pair of large plates far away from the channel and applying a potential diﬀerence between the two plates. Because the space between the electrodes is ﬁlled with electrolyte solutions, each reservoir is in isopotential. That is, the average potential anywhere in the reservoir is identical to the applied potential at the voltage place (i) on that side. For ion i at position xt = x = (x, y, z), Φext λ (x) = λz denotes the potential on ion i due to the applied external ﬁeld. The electrical ﬁeld acting on each ion due to the applied potential is therefore −∇x(i) Φext λ (x) = (0, 0, λ) V/m t at all x ∈ R. It is this applied external ﬁeld that causes a drift of ions from the reservoir R1 to R2 via the ion channel C. As a result of this drift of ions within the electrolyte in the two reservoirs, eventually the measured potential drop across the reservoirs is zero and all the potential drop occurs across the ion channel. Inter-ion Coulomb potential: In equation 4.29, ΦC,i (Xt ) denotes the Coulomb interaction between ion i and all the other ions ΦC,i (Xt ) =

1 4π0

2N

q (j)

(i) j=1,j=i w xt

(j)

− xt

.

(4.30)

Ion-wall interaction potential: The ion-wall potential ΦIW , also called the (σ/r)9 , (i) potential ensures that the position xt of all ions i = 1, . . . , 2N lie in Ro . With (i) (i) (i) (i) xt = (xt , yt , zt ) , it is modeled as (i)

ΦIW (xt ) =

(r(i) + rw )9 F0 / 9 , 9 (i) 2 (i) 2 (xt + yt rc + r w −

(4.31)

where for positive ions r(i) = r+ (radius of Na+ atom) and for negative ions r(i) = r− (radius of Cl− atom); rw = 1.4 ˚ Ais the radius of atoms making up the wall; rc denotes the radius of the ion channel, and F0 = 2 × 10−10 N, which is estimated from the ST2 water model used in molecular dynamics (Stillinger and Rahman, 1974). This ion-wall potential results in short-range forces that are only signiﬁcant when the ion is close to the wall of the reservoirs R1 and R2 or anywhere in the ion channel C (since the ion channel is comparable in radius to the ions). Short-range potential: Finally, at short ranges, the Coulomb interaction between two ions is modiﬁed by adding a potential ΦSR,i (Xt ), which replicates the eﬀects of the overlap of electron clouds. Thus, Φ

SR,i

F0 (Xt ) = 9

2N

(r(i) + r(j) )

j=1,j=i

xt − xt 9

(i)

(j)

.

(4.32)

Similar to the ion-wall potential, ΦSR,i is signiﬁcant only when ion i gets very close to another ion. It ensures that two opposite-charge ions attracted by inter-ion Coulomb forces 4.30 cannot collide and annihilate each other. Molecular dynamics simulations show that the hydration forces between two ions add further structure

102

Sensor Adaptive Signal Processing of Biological Nanotubes (i)

(j)

to the 1/|xt − xt 9 repulsive potential due to the overlap of electron clouds in the form of damped oscillations (Gu` ardia et al., 1991a,b). Corry et al. (2001) incorporated the eﬀect of the hydration forces in equation 4.32 in such a way that the maxima of the radial distribution functions for Na+ -Na+ , Na+ -Cl− , and Cl− Cl− would correspond to the values obtained experimentally.

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm Having given a complete description of the dynamics of individual ions that permeate through the ion channel, in this section we give a probabilistic characterization of the ion channel current. In particular, we show that the mean ion channel current satisﬁes a boundary-valued partial diﬀerential equation. We then show that the Brownian dynamics (BD) simulation algorithm can be viewed as a randomized multiparticle-based algorithm for solving the boundary-valued partial diﬀerential equation to estimate the ion channel current. 4.5.1

Probabilistic Characterization of Ion Channel Current in Terms of Mean Passage Time

The aim of this subsection is to give a probabilistic characterization of the ion channel current in terms of the mean ﬁrst passage time of the diﬀusion process (see equation 4.27). This characterization also shows that the Brownian dynamical system 4.27 has a well-deﬁned, unique stationary distribution. A key requirement in any mathematical construction is that the concentration of ions in each reservoir R1 and R2 remains approximately constant and equal to the physiological concentration. The following probabilistic construction ensures that the concentration of ions in reservoir R1 and R2 remain approximately constant. Step 1: The 2N ions in the system are initialized as described above, and the ion channel C is closed. The system evolves and attains stationarity. Theorem 4.2 below shows that the probability density function of the 2N particles converges geometrically fast to a unique stationary distribution. Theorem 4.3 shows that in the stationary regime, all positive ions in R1 have the same stationary distribution and so are statistically indistinguishable (similarly for R2 ). Step 2: After stationarity is achieved, the ion channel is opened. The ions evolve according to equation 4.27. As soon as an ion from R1 crosses the ion channel C and enters R2 , the experiment is stopped. Similarly if an ion from R2 cross C and enters R1 , the experiment is stopped. Theorem 4.3 gives partial diﬀerential equations for the mean minimum time an ion in R1 takes to cross the ion channel and reach R2 and establishes that this time is ﬁnite. From this a theoretical expression for the mean ion channel current is constructed (eq. 4.42).

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

103

Note that if the system was allowed to evolve for an inﬁnite time with the channel open, then eventually, due to the external applied potential, more ions would be in R2 than R1 . This would violate the condition that the concentration of particles in R1 and R2 remain constant. In the BD simulation algorithm 4.2 presented later in this chapter, we use the above construction to restart the simulation each time an ion crosses the channel—this leads to a regenerative process that is easy to analyze. Let (1) (2) (θ,λ) (2N ) (1) (2) (2N ) (X, V) = p(θ,λ) xt , xt , . . . , xt , vt , vt , . . . , vt πt denote the joint probability density function (pdf) of the position and velocity of all the 2N ions at time t. We explicitly denote the θ, λ dependence of the pdfs since external potential λ. Then the joint pdf they depend on the PMF Uθ and applied (1) (θ,λ) (2) (2N ) of the positions of all 2N ions at time t is (X) = p(θ,λ) xt , xt , . . . , xt πt

(θ,λ) (θ,λ) πt (X) = πt (X, V)dV. R6N

The following result, proved in Krishnamurthy and Chung (a), states that for (θ,λ) (X, V) converges exponentially fast the above stochastic dynamical system, πt (θ,λ) to its stationary (invariant) distribution π∞ (X, V). Theorem 4.2 For the Brownian dynamics system (4.27, 4.28), with ζ = (X, V), there exists a (θ,λ) unique stationary distribution π∞ (ζ), and constants K > 0 and 0 < ρ < 1, such that sup ζ∈R2N ×R6N

(θ,λ)

|πt

(θ,λ) (ζ) − π∞ (ζ)| ≤ KV(ζ)ρt .

(4.33)

Here V(ζ) > 1 is an arbitrary measurable function on R2N × R6N . The above theorem on the exponential ergodicity of ζt = (Xt , Vt ) has two consequences that we will subsequently use. First, it implies that as the system evolves, (i) (i) the initial coordinates x0 , v0 of all the 2N ions are forgotten exponentially fast. This allows us to eﬃciently conduct BD simulations in section 4.5.2 below. Second, the exponential ergodicity also implies that a strong law of large numbers holds— this will be used below to formulate a stochastic optimization problem in terms of (θ,λ) the stationary measure π∞ for computing the potential mean force. Notation is as follows: For ζ = (ζ (1) , . . . ζ (4N ) ) , deﬁne ∂ ∂ ∂ , , . . . , (4N ) . ∇ζ = ∂ζ (1) ∂ζ (2) ∂ζ

104

Sensor Adaptive Signal Processing of Biological Nanotubes

0 1 For a vector ﬁeld fθ,λ (ζ) = f (1) (ζ) f (2) (ζ) · · · f (4N ) (ζ) deﬁned on R4N , deﬁne the divergence operator div (fθ,λ ) =

∂f (1) ∂f (2) ∂f (4N ) + (2) + · · · + (4N ) . (1) ∂ζ ∂ζ ∂ζ

For the stochastic dynamical system 4.27, comprising 2N ions, deﬁne the backward elliptic operator (inﬁnitesimal generator) L and its adjoint L∗ for any test function φ(ζ) as 1 Tr[Σ∇2ζ φ(ζ)] + (fθ,λ (ζ) + Aζ) ∇ζ φ(ζ), 2 1 1 0 L∗ (φ) = Tr ∇2ζ (Σφ(ζ)) − div[(Aζ + fθ,λ (ζ))φ(ζ)]. 2 L(φ) =

(4.34)

Here, fθ,λ and Σ are deﬁned in equation 4.28. (θ,λ) (·) of ζt = (Xt , Vt ) It is well known that the probability density function πt satisﬁes the Fokker-Planck equation (Wong and Hajek, 1985): (θ,λ)

dπt dt Fokker-Planck equation

= L∗ πt

(θ,λ)

.

(4.35) (θ,λ)

Also the stationary probability density function π∞ (·) satisﬁes

(θ,λ) (θ,λ) L∗ (π∞ ) = 0, π∞ (X, V)dX dV = 1. R6N

(4.36)

R2N

We next show that once stationarity has been achieved, the N positive ions behave statistically identically, i.e., each ion has the same stationary marginal (θ,λ) distribution. Deﬁne the stationary marginal density π∞ (x(i) , v(i) ) of ion i as

(θ,λ) π∞ (x(i) , v(i) ) =

R6N −3

R2N −1

(θ,λ) π∞ (X, V)

2N

dx(j) dv(j) .

(4.37)

j=1,j=i

The following result states that the ions are statistically indistinguishable—see the paper of Krishnamurthy and Chung (a) for proof. Theorem 4.3 Assuming that the ion channel C is closed, the stationary marginal densities for the positive ions in R1 are identical:

(θ,λ) (θ,λ) (θ,λ) (θ,λ),R1 π∞ = π∞ (x(1) , v(1) ) = π∞ (x(2) , v(2) ) = · · · = π∞ (x(N ) , v(N/2) ).

Similarly, the stationary marginal densities for the positive ions in R2 are identical:

(θ,λ),R2 (θ,λ) (θ,λ) π∞ = π∞ (x(N/2+1) , v(N/2+1) ) = π∞ (x(N/2+2) , v(N/2+2) ) (θ,λ) (x(N ) , v(N ) ). = · · · = π∞

(4.38)

Theorem 4.3 is not surprising: equations 4.23, 4.24, and 4.25 are symmetric in

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

mean ﬁrstpassage time

105

i, therefore intuitively one would expect that once steady state as been attained, all the positive ions behave identically—similarly with the negative ions. Due to above result, in our probabilistic formulation below, once the system has attained steady state, any positive ion is representative of all the N positive ions, and similarly for the negative ions. Assume that the system 4.27 comprising 2N ions has attained stationarity with the ion channel C closed. Then the ion channel is opened so that ions can diﬀuse (θ,λ) into it. Let τR1 ,R2 denote the mean ﬁrst-passage time for any of the N/2 Na+ ions (θ,λ) in R1 to travel to R2 via the gramicidin-Achannel C, and τR2 ,R1 denote the mean ﬁrst-passage time for any of the N/2 Na+ ions in R2 to travel to R1 :

(θ,λ) (1) (2) (N/2) τR1 ,R2 = E {tβ } where tβ = inf t : max zt , zt , . . . , zt ≥β ,

(θ,λ) (N/2+1) (N/2+2) (2N ) ≤α . , zt , . . . , zt τR2 ,R1 = E {tα } where tα = inf t : min zt (4.39) Note that for ion channels such as gramicidin-A, only positive Na+ ions ﬂow through the channel to cause the channel current—so we do not need to consider the mean ﬁrst-passage times of the Cl− ions. In order to give a partial diﬀerential equation (θ,λ) (θ,λ) for τR1 ,R2 and τR2 ,R1 , it is convenient to deﬁne the closed sets 2 3 (1) (2) (N/2) ≥ β} , P2 = ζ : {z ≥ β} ∪ {z ≥ β} ∪ · · · ∪ {z 2 3 (4.40) P1 = ζ : {z (N/2+1) ≤ α} ∪ {z (N/2+2) ≤ α} ∪ · · · ∪ {z (2N ) ≤ α} .

(1) (2) (N/2) Then it is clear that ζt ∈ P2 is equivalent to max zt , zt , . . . , zt ≥ β to R2 . since either expression implies that at least one ion has crossed from R 1 (N/2+1) (N/2+2) (2N ) Similarly ζt ∈ P1 is equivalent to min zt , zt , . . . , zt ≤ α. Thus tβ and tα deﬁned in system 4.39 can be expressed as tβ = inf{t : ζt ∈ P2 }, tα = inf{t : ζt ∈ P1 }. Hence system 4.39 is equivalent to (θ,λ)

τR1 ,R2 = E {inf{t : ζt ∈ P2 }} ,

(θ,λ)

τR2 ,R1 = E {inf{t : ζt ∈ P1 }} . (θ,λ)

(4.41) (θ,λ)

In a gramicidin-Achannel, typically τR2 ,R1 is much larger compared to τR1 ,R2 . (θ,λ) (θ,λ) In terms of the mean ﬁrst-passage times τR1 ,R2 , τR2 ,R1 deﬁned in equations 4.39 and 4.41, the mean current ﬂowing from R1 via the gramicidin-Aion channel C into R2 is deﬁned as 1 1 − (θ,λ) . (4.42) I (θ,λ) = q + (θ,λ) τR1 ,R2 τR2 ,R1 The following result adapted from (Gihman and Skorohod, 1972, p. 306) shows (θ,λ) (θ,λ) that τR1 ,R2 , τR2 ,R1 satisfy a boundary-valued partial diﬀerential equation.

106

Sensor Adaptive Signal Processing of Biological Nanotubes

Theorem 4.4 (θ,λ) (θ,λ) The mean ﬁrst-passage times τR1 ,R2 and τR2 ,R1 in 4.42 are obtained as

(θ,λ) (θ,λ) (θ,λ) τR1 ,R2 = τR1 ,R2 (ζ)π∞ (ζ)dζ, Ξ R1

(θ,λ) (θ,λ) (θ,λ) τR2 ,R1 (ζ)π∞ (ζ)dζ, τR2 ,R1 =

(4.43) (4.44)

Ξ R2

where (θ,λ)

(4.45)

(θ,λ)

(4.46)

τR1 ,R2 (ζ)E {inf{t : ζt ∈ P2 |ζ0 = ζ}} , τR2 ,R1 (ζ)E {inf{t : ζt ∈ P1 |ζ0 = ζ}} . (θ,λ)

boundary-valued PDE for mean ﬁrst passage time

(θ,λ)

Here τR1 ,R2 (ζ) and τR2 ,R1 (ζ) satisfy the following boundary-valued partial diﬀerential equations (θ,λ)

ζ ∈ P2 ,

τR1 ,R2 (ζ) = 0 ζ ∈ P2 ,

(θ,λ)

ζ ∈ P1 ,

τR2 ,R1 (ζ) = 0 ζ ∈ P1 ,

L(τR1 ,R2 (ζ)) = −1 L(τR2 ,R1 (ζ)) = −1

(θ,λ) (θ,λ)

(4.47)

where L denotes the backward operator deﬁned in equation 4.34. Furthermore, (θ,λ) (θ,λ) τR1 ,R2 and τR2 ,R1 are ﬁnite. The proof of the equations 4.47 directly follows from corollary 1, p. 306 in Gihman and Skorohod (1972), which shows that the mean ﬁrst-passage time from (θ,λ) any point ζ to a closed set P2 satisﬁes equations 4.47. The proof that τR1 ,R2 and (θ,λ) τR2 ,R1 are ﬁnite follows directly from p. 145 of Friedman (1975). Remark: Equation 4.42 speciﬁes the mean current as the charge per mean time it takes for an ion to cross the ion channel. Instead of equation 4.42, an alternative deﬁnition of the mean current is the expected rate of charge across the ion channel, i.e.,

˜ λ) = q + μ(θ,λ) − μ(θ,λ) , (4.48) I(θ, R1 ,R2 R2 ,R1 (θ,λ)

(θ,λ)

where with tα and tβ deﬁned in equation 4.39, the mean rates μR1 ,R2 and μR2 ,R1 are deﬁned as 2 3 2 3 1 1 (θ,λ) (θ,λ) μR1 ,R2 = E , μR2 ,R1 = E . (4.49) tα tβ It is important to note that the two deﬁnitions of current—namely I (θ,λ) in 4.42 ˜ λ) in 4.48 are not equivalent, since E {1/tβ } = 1/E {tβ }. Similar to the and I(θ, (θ,λ) proof of theorem 4.4, partial diﬀerential equations can be obtained for μR1 ,R2 and (θ,λ) μR2 ,R1 —however, the resulting boundary conditions are much more complex than equations 4.47.

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

4.5.2

107

Brownian Dynamics Simulation for Estimation of Ion Channel Current

It is not possible to obtain explicit closed-form expressions for the mean ﬁrst passage (θ,λ) (θ,λ) times τR2 ,R1 and τR2 ,R1 and hence the current I (θ,λ) in equation 4.42. The aim of BD simulation is to obtain estimates of these quantities by directly simulating the stochastic dynamical system 4.27. In this subsection we show that the current estimates Iˆ(θ,λ) (L) (deﬁned below) obtained from an L-iteration BD simulation are statistically consistent, i.e., limL→∞ Iˆ(θ,λ) (L) = I (θ,λ) almost surely. Due to the applied external potential Φext λ (see equation 4.29), ions drift from reservoir R1 via the ion channel C to the reservoir R2 thus generating an ion channel current. In order to construct an estimate for the current ﬂowing from R1 to R2 in the BD simulation, we need to count the number of upcrossings of ions (i.e., the number of times ions cross from R1 to R2 across the region C) and downcrossings (i.e., the number of times ions cross from R2 to R1 across the region C). Recall from ﬁg. 4.3 that z = α = −12.5˚ A denotes the boundary between R1 and C, and z = β = 12.5˚ A denotes the boundary between R2 and C. Time Discretization of Ion Dynamics To implement the BD simulation algorithm described below on a digital computer, it is necessary to discretize the continuous-time dynamics (see e.g. 4.27) of the 2N ions. The BD simulation algorithm typically uses a sampling interval of Δ = 10−15 , i.e., 1 femtosecond for this time discretization, and propagates the 2N ions over a total time period of T = 10−4 seconds. The time discretization proceeds as follows: Consider a regular partition 0 = t0 < t1 < · · · < tk−1 < tk < · · · < T with discretization interval Δ = tk − tk−1 = 10−15 seconds. There are several possible methods for time discretization of the stochastic diﬀerential equation 4.27; see Kloeden and Platen (1992) for a detailed exposition. Here we brieﬂy present a zero-order hold and ﬁrstorder hold approximation. The ﬁrst-order hold approximation was derived by van Gunsteren et al. (1981). It is well known (Wong and Hajek, 1985) that over the time interval [tk , tk+1 ), the solution of equation 4.27 satisﬁes

tk+1

tk+1 AΔ A(tk+1 −τ ) e fθ,λ (ζτ )dτ + eA(tk+1 −τ ) Σ1/2 dwτ . (4.50) ζtk+1 = e ζtk + tk

tk

In the zero-order hold approximation, fθ,λ (ζτ ) is assumed to be approximately constant over the short interval [tk , tk+1 ) and is set to the constant fθ,λ (ζtk ) in equation 4.50. This yields

tk+1

tk+1 eA(tk+1 −τ ) fθ,λ (ζtk )dτ + eA(tk+1 −τ ) Σ1/2 dwτ . (4.51) ζtk+1 = eAΔ ζtk + tk

tk

In the ﬁrst-order hold, the following approximation is used in equation 4.50: fθ,λ (ζτ ) ≈ f (ζtk ) + (τ − tk )

∂fθ,λ (ζt ) . ∂t

108

Sensor Adaptive Signal Processing of Biological Nanotubes

In van Gunsteren et al. (1981), the derivative above is approximated by fθ,λ (ζtk ) − fθ,λ (ζtk−1 ) ∂fθ,λ (ζt ) ≈ . ∂t Δ Thus the ﬁrst-order hold approximation of van Gunsteren et al. (1981) yields ζtk+1 = eAΔ ζtk +

tk+1

eA(tk+1 −τ ) fθ,λ (ζtk )dτ

tk tk+1

+

eA(tk+1 −τ ) (τ − tk )

tk

fθ,λ (ζtk ) − fθ,λ (ζtk−1 ) dτ + Δ

tk+1

eA(tk+1 −τ ) Σ1/2 dwτ .

tk

(4.52) Let k = 0, 1, . . . denote discrete time where k corresponds to time tk . Note that the last integral above is merely a discrete-time Gauss-Markov process, which (d) we will denote as wk . Moreover, since the ﬁrst block element of w in (4.26) is 0, 06N ×1 (d) . (4.53) wk = (d) w ¯k (d)

(d)

where the 6N dimensional vector w ¯ k denotes the nonzero components of wk . We now elaborate on the zero-order hold model. Next, due to the simple structure of A in equation 4.28, the matrix exponentials

tk+1 eA(tk+1 −τ ) dτ (4.54) Γ = eAΔ , B = tk

in equation 4.50 can be explicitly computed as

tk+1 I I L L 6N ×6N 6N ×6N eA(tk+1 −τ ) dτ = , B= Γ = eAΔ = 06N ×6N eDΔ 06N ×6N eDΔ tk −γ + I3N ×3N 03N ×3N (4.55) , L = D−1 eDΔ − I . where D = − 03N ×3N −γ I3N ×3N Then the above update for ζtk+1 in discrete-time notation reads: (d)

(d)

(d)

(d)

ζk+1 = Γζk + B fθ,λ (ζk ) + wk .

(4.56)

Expanding this out in terms of Xk and Vk , we have the following discrete time dynamics for the positions and velocities of the 2N ions:

(d)

Xk+1 = Xk + LVk + D−1 (L − ΔI)Fθ,λ (Xk )

(4.57)

Vk+1 = eDΔ Vk + LFθ,λ (Xk ) +

(4.58)

(d) w ¯k ,

where w ¯ k is a 6N dimensional discrete-time Gauss Markov process. Brownian Dynamics Simulation Algorithm In the BD simulation Algorithm 4.2 below, we use the following notation:

4.5

Ion Permeation: Probabilistic Characterization and Brownian Dynamics (BD) Algorithm

109

The algorithm runs for L iterations where L is user speciﬁed. Each iteration l, l = 1, 2, . . . , L, runs for a random number of discrete-time steps until an ion crosses (l) the channel. We denote these random times as τˆR1 ,R2 if the ion has crossed from (l) R1 to R2 and τˆR2 ,R1 if the ion has crossed from R2 to R1 . Thus (l)

(d)

τˆR1 ,R2 = min{k : ζk

∈ P2 },

(l)

(d)

τˆR2 ,R1 = min{k : ζk

∈ P1 }.

(θ,λ)

The positive ions {1, 2, . . . , N/2} are in R1 at steady state π∞ , and the positive ions {N/2 + 1, . . . , 2N } are in R2 at steady state. LR1 ,R2 is a counter that counts how many Na+ ions have crossed from R1 to R2 , and LR2 ,R1 counts how many Na+ ions have crossed from R2 to R1 . Note that LR1 ,R2 + LR2 ,R1 = L. We only consider passage of positive Na+ ions i = 1, . . . , N across the ion channel since in a gramicidin-Achannel the ion channel current is caused only by Na+ ions. Algorithm 4.2 Brownian Dynamics Simulation Algorithm (for Fixed θ and λ) Input parameters θ for PMF and λ for applied external potential. For l = 1 to L iterations: (θ,λ)

Step 1: Initialize all 2N ions according to stationary distribution π∞ deﬁned in equation 4.36. Open ion channel at discrete time k = 0 and set k = 1.

Step 2: Propagate all 2N ions according to the time-discretized Brownian dynamical system 4.56 until time k ∗ at which an ion crosses the channel. ∗ If ion crossed ion channel from R1 to R2 , i.e., for any ion i∗ ∈ (i∗ ) (l) {1, 2, . . . , N/2}, zk∗ ≥ β, then set τˆR1 ,R2 = k ∗ . Update number of crossings from R1 to R2 : LR1 ,R2 = LR1 ,R2 + 1. ∗ If ion crossed ion channel from R2 to R1 , i.e., for any ion i∗ ∈ {N/2 + (i) (l) 1, . . . , N }, zk∗ ≤ α then set τˆR2 ,R1 = k ∗ . Update number of crossings from R2 to R1 : LR2 ,R1 = LR2 ,R1 + 1. Step 3: End for loop. Compute the mean ﬁrst passage time and mean current estimate after L iterations as (θ,λ) τˆR1 ,R2 (L)

=

LR1 ,R2

1 LR1 ,R2

Iˆ(θ,λ) (L) = q +

(l) τˆR1 ,R2 ,

(θ,λ) τˆR2 ,R1 (L)

l=1

1

(θ,λ)

τˆR1 ,R2 (L)

−

1 (θ,λ)

.

=

1 LR2 ,R1

LR2 ,R1

(l)

τˆR2 ,R1 , (4.59)

l=1

(4.60)

τˆR2 ,R1 (L)

The following result shows that the estimated current Iˆ(θ,λ) (L) obtained from a BD simulation run over L iterations is strongly consistent.

110

Sensor Adaptive Signal Processing of Biological Nanotubes

Theorem 4.5 For ﬁxed PMF θ ∈ Θ and applied external potential λ ∈ Λ, the ion channel current estimate Iˆ(θ,λ) (L) obtained from the BD simulation algorithm 4.2 over L iterations is strongly consistent, i.e., lim Iˆ(θ,λ) (L) = I (θ,λ)

L→∞

w.p.1

(4.61)

where I (θ,λ) is the mean current deﬁned in equation 4.42. Proof Since by construction each of the L iterations is statis 4.2, in algorithm (l) (l) tically independent, and E τˆR1 ,R2 , E τˆR2 ,R1 are ﬁnite (see theorem 4.4), by Kolmogorov’s strong law of large numbers (θ,λ)

(θ,λ)

lim τˆR1 ,R2 (L) = τR1 ,R2 ,

Thus q +

L→∞

1

(θ,λ) (L) 1 ,R2

τˆR

−

1

(θ,λ) (L) 2 ,R1

τˆR

(θ,λ)

(θ,λ)

lim τˆR2 ,R1 (L) = τR2 ,R1

w.p.1.

L→∞

→ I (θ,λ) w.p.1 as L → ∞.

Remark: Instead of equation 4.42, if the mean rate deﬁnition in equation 4.48 is used ˜ λ), then the following minor modiﬁcation of algorithm 4.2 for the mean current I(θ, ˜ λ). Instead of equations 4.59 and 4.60, use yields consistent estimates of I(θ, (θ,λ) μ ˆR1 ,R2 (L)

=

LR1 ,R2

Iˆ(θ,λ) (L) = q +

1

l=1

(l) τˆR1 ,R2

LR1 ,R2

1

,

(θ,λ) μ ˆR2 ,R1 (L)

=

1

l=1

(l) τˆR2 ,R1

LR2 ,R1

1 LR2 ,R1

,

(4.62)

(θ,λ) (θ,λ) ˆR2 ,R1 (L) . μ ˆR1 ,R2 (L) − μ

(4.63)

˜ λ) w.p.1, Then a virtually identical proof to theorem 4.5 yields that Iˆ(θ,λ) (L) → I(θ, as L → ∞. Implementation Details and Variations of Algorithm 4.2 In algorithm (θ,λ) 4.2, the procedure of resetting all ions to π∞ in step 1 when any ion crosses the channel can be expressed mathematically as (d)

(d)

(d)

(d)

(d)

ζk+1 = 1ζ (d) ∈P2 ∪P1 [fθ,λ (ζk ) + wk ] + 1ζ (d) ∈P2 ∪P1 ζ0 , k

k

(d)

ζ0

(θ,λ) ∼ π∞ ,

(4.64)

where P1 , P2 are deﬁned in equation 4.40. The following approximations of algorithm 4.2 can be used in actual numerical simulations. Instead of steps 2a and 2b, only remove the crossed ion denoted as i∗ and put (θ,λ),R1 (θ,λ),R2 or π∞ (eqs. 4.3 and 4.38) it back in its reservoir with probability π∞ or R . The other particles are not depending on whether it originated from R 1 2 i∗ ¯ reset. With ζ (d) k denoting the position and velocity of the crossed ion and ζ (d) k denoting the positions and velocities of the remaining 2N − 1 ions, mathematically

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel

111

this is equivalent to replacing equation 4.64 by i∗

¯ , ζ (d) ) + w ], ζk+1 = 1ζ (d) ∈P2 ∪P1 [fθ,λ (ζk ) + wk ] + 1ζ (d) ∈P2 ∪P1 [fθ,λ (ζ (d) k k k (d)

(d)

k

(d)

(d)

(d)

(d)

k

(4.65) i∗

(θ,λ),R1

where ζ (d) k ∼ π∞ {N + 1, . . . , 3N/2}.

i∗

if i ∈ {1, . . . , N/2}, and ζ (d) k

(θ,λ),R2

∼ π∞

if i ∈

i∗

As in the above approximation (eq. 4.65), except that ζ (d) k is replaced according to an uniform distribution. The above approximations are justiﬁed for three reasons: 1. Only one ion can be inside the gramicidin-Achannel C at any time instant. When this happens the ion channel behaves as though it is closed. Then the probabilistic construction of step 1 in section 4.5.1 applies. 2. The probability density functions of the remaining 2N − 1 ions converge rapidly to their stationary distribution and forget their initial distribution exponentially fast. This is due to the exponential ergodicity theorem 4.2. In comparison the time taken for an ion to cross the channel is signiﬁcantly larger. As a result the removal of crossed particles and their replacement in the reservoir happens extremely infrequently. Between such events the probability density functions of the ions rapidly converge to their stationary distribution. 3. If an ion enters the channel C, then the change in concentration of ions in the reservoir is of magnitude 1/N . This is negligible if N is chosen suﬃciently large.

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel Having given a complete description of the dynamics of individual ions in section 4.4 and the Brownian dynamics algorithm for estimating the ion channel current, in this section we describe how the Brownian dynamics algorithms can be adaptively controlled to determine the molecular structure of the ion channel. We will estimate the PMF proﬁle Uθ parameterized by θ, by computing the θ that optimizes the ﬁt between the mean current I (θ,λ) (deﬁned above in eq. 4.42) and the experimentally observed current y(λ) deﬁned below. Unfortunately, it is impossible to explicitly compute I (θ,λ) from equation 4.42. For this reason we resort to a stochastic optimization problem formulation below, where consistent estimates of I (θ,λ) are obtained via the Brownian dynamics simulation algorithm 4.2. The main algorithm presented in this section is the adaptive Brownian dynamics simulation algorithm (algorithm 4.3) which solves the stochastic optimization problem and yields the optimal PMF. We have showed the eﬀective surface charge density along the protein of the inside surface of the ion channel from the PMF (Krishnamurthy and Chung, b).

112

Sensor Adaptive Signal Processing of Biological Nanotubes

4.6.1

Formulation of PMF Estimation as Stochastic Optimization Problem

The stochastic optimization problem formulation for determining the optimal PMF estimate comprises the following four ingredients: Experimentally Observed Ion Channel Current y(λ) Neurobiologists use the patch-clamp experimental setup to obtain experimental measurements of the current ﬂowing through a single ion channel. Typically the measured discrete-time (digitized) current from a patch-clamp experiment is obtained by sampling the continuous-time observed current at 10 kHz (i.e., 0.1 millisecond intervals). Note that this is at a much slower timescale than the dynamics of individual ions which move around at a femtosecond timescale. Such patch clamping was widely regarded as a breakthrough in the 1970s for understanding the dynamics of ion channels at a millisecond timescale. From patch-clamp experimental data, neurobiologists can obtain an accurate measurement of the actual current y(λ) ﬂowing through a gramicidin-Aion channel for various external applied potentials λ ∈ Λ. For example, as shown in Chung et al. (1991), the resulting discrete time series can be modeled as HMM. Then by using a HMM maximum likelihood estimator (Chung et al., 1991; James et al., 1996), accurate estimates of the open current level y(λ) of the ion channel can be computed. Neurobiologists typically plot the relationship between the experimentally determined current y(λ) vs. applied voltage λ on an IV curve—such curves provide a unique signature for an ion channel. For our purposes y(λ) denotes the true (real-world) channel current. Loss Function Let n = 1, 2, . . . denote the batch member. For ﬁxed applied ﬁeld λ ∈ Λ, consider at batch n, running the BD simulation algorithm 4.2, resulting (θ,λ) in the simulated current In . Deﬁne the mean square error loss function equation as Q(θ, λ) = E |In(θ,λ) − y(λ)|2 , (4.66) (θ,λ) 2 where Q(θ, λ)n = In − y(λ) . Deﬁne the total loss function obtained by adding the mean square error over all the applied ﬁelds λ ∈ Λ on the IV curve as Q(θ, λ). (4.67) Q(θ) = λ∈Λ

The optimal PMF Uθ∗ is determined by the parameter θ∗ that best ﬁts the mean current I (θ,λ) to the experimentally determined IV curve of a gramicidin-Achannel, i.e., θ∗ = arg min Q(θ). θ∈Θ

(4.68)

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel

θn

Brownian Dynamics Simulation

(θ,λ) Iˆn

Loss function

Qn (θ, λ)

evaluation

Stochastic ∇ θ Qn (θ, λ) Gradient Algorithm Figure 4.5

113

Gradient Estimator

Adaptive Brownian dynamics simulation for estimating PMF.

Let Θ∗ denote the set of local minima whose elements θ∗ satisfy the second-order suﬃcient conditions for being a local minimum: θ Q(θ) = 0, ∇

2 Q(θ) > 0, ∇ θ

(4.69)

where the notation ∇2 Q(θ) > 0 means that it is a positive deﬁnite matrix. However, the deterministic optimization (eqs. 4.66,4.68) cannot be directly carried out since it is not possible to obtain explicit closed-form expressions for 4.66—this is because the partial diﬀerential equation 4.47 for the mean ﬁrst passage (θ,λ) (θ,λ) times τR2 ,R1 and τR2 ,R1 cannot be solved explicitly. This motivates us to formulate the estimation of the PMF as a stochastic optimization problem. 4.6.2

Stochastic Gradient Algorithms for Estimating Potential of Mean Force (PMF) and the Need for Gradient Estimation

We now give a complete description of the adaptive Brownian dynamics simulation algorithm for computing the optimal PMF estimate Uθ∗ . The algorithm is schematically depicted in ﬁg. 4.5. Recall that n = 0, 1, · · · , denotes batch number. Algorithm 4.3 Adaptive Brownian Dynamics Simulation Algorithm for Estimating PMF Step 0: Set batch index n = 0, and initialize θ0 ∈ Θ. Step 1 Evaluation of loss function: At batch n, evaluate loss function Qn (θn , λ) for each external potential λ ∈ Λ according to equation 4.66. This uses one independent BD simulation (algorithm 4.2) for each λ. θ Qn (θn , λ) either as a Step 2 Gradient estimation: Compute gradient estimate ∇ ﬁnite diﬀerence (see eq. 4.72 below), or according to the SPSA algorithm (eq. 4.73) below.

114

Sensor Adaptive Signal Processing of Biological Nanotubes

Step 3 Stochastic approximation algorithm: Update PMF estimate: θ Qn (θn , λ) θn+1 = θn − n+1 ∇

(4.70)

λ∈Λ

where n denotes a decreasing step size (see discussion below for choice of step size). Step 4: Set n to n + 1 and go to step 1. A crucial aspect of the above algorithm is the gradient estimation step 2. In θ Qn (θ, λ) of the gradient ∇θ Qn (θ, λ) is computed. This this step, an estimate ∇ gradient estimate is then fed to the stochastic gradient algorithm (step 3) which updates the PMF. Note that since the explicit dependence of Qn (θ, λ) on θ is not known, it is not possible to compute ∇θ Qn (θ, λ). Thus we have to resort to gradient estimation, e.g., the ﬁnite diﬀerence estimators described below or a more sophisticated algorithm such as IPA (inﬁnitesimal perturbation analysis). The step size n is typically chosen as n = /(n + 1 + R)κ ,

(4.71)

where 0.5 < κ ≤ 1 and R is some positive constant. Note that this choice of &∞ step size automatically satisﬁes the condition n=1 n = ∞, which is required for convergence of algorithm 4.3. Kiefer-Wolfowitz Finite Diﬀerence Gradient Estimator An obvious gradient estimator is obtained by ﬁnite diﬀerences as follows: Suppose θ is a p dimensional vector. Let e1 , e2 , . . . , ep denote p-dimensional unit vectors, where ei is a unit vector with 1 in the ith position and zeros elsewhere. Then the two-sided ﬁnite diﬀerence gradient estimator is ⎤ ⎡ Qn (θn + μn e1 , λ) − Qn (θn − μn e1 , λ) 2μn ⎥ ⎢ ⎢ Qn (θn + μn e2 , λ) − Qn (θn − μn e2 , λ) ⎥ ⎢ ⎥ ⎥ 2μn θ Qn (θ, λ) = ⎢ (4.72) ∇ ⎢ ⎥. .. ⎢ ⎥ . ⎢ ⎥ ⎣ ⎦ Qn (θn + μn ep , λ) − Qn (θn − μn ep , λ) 2μn Using equation 4.72 in algorithm 4.3 yields the so-called Finite diﬀerence stochastic gradient algorithm. In the above gradient estimator, μk = μ/(k +1)γ , where typically γ < κ (where κ is deﬁned in eq. 4.71), e.g., γ = 0.101 and κ = 0.602. The main disadvantages of the above ﬁnite gradient estimator are twofold. First, the bias of the gradient estimate is O(μ2n ), i.e., θ Qn (θ, λ)|θ1 , . . . , θn = O(μ2n ). E ∇ Second, the simulation cost of implementing the above estimator is large. It requires 2p BD simulations, since one BD simulation is required to evaluate Qn (θn + μn ei , λ) and one BD simulation is required to evaluate Qn (θn − μn ei , λ) for each i = 1, 2, . . . , p.

4.6

Adaptive Brownian Dynamics Mesoscopic Simulation of Ion Channel

115

Simultaneous Perturbation Stochastic Approximation (SPSA) Algorithm Unlike the Kiefer-Wolfowitz algorithm, the SPSA algorithm (Spall, 2003) is a novel method that picks a single random direction dn along which direction the derivative is evaluated at each batch n. Thus the main advantage of SPSA θ Qn (θ, λ) compared to ﬁnite diﬀerence is that evaluating the gradient estimate ∇ in SPSA requires only two BD simulations, i.e., the number of evaluations is independent of the dimension p of the parameter vector θ. We refer the reader to Spall (2003) and to the Web site www.jhuapl.edu/SPSA/ for details, variations, and applications of the SPSA algorithm. The SPSA algorithm proceeds as follows: Generate the p-dimensional vector dn with random elements dn (i), i = 1, . . . , p simulated as follows: ' −1 with probability 0.5 dn (i) = +1 with probability 0.5. Then the SPSA algorithm uses the following gradient estimator together with the stochastic gradient algorithm (eq. 4.70): ⎡ ⎤ Qn (θn + μn dn , λ) − Qn (θn − μn dn , λ) ⎢ ⎥ 2μn dn (1) ⎥ ⎢ ⎢ Qn (θn + μn dn , λ) − Qn (θn − μn dn , λ) ⎥ ⎥ ⎢ 2μn dn (2) θ Qn (θ, λ) = ⎢ ⎥. (4.73) ∇ ⎥ ⎢ . .. ⎥ ⎢ ⎥ ⎢ ⎣ Q (θ + μ d , λ) − Q (θ − μ d , λ) ⎦ n n n n n n n n 2μn dn (p) Here μk is chosen by a process similar to that of the Kiefer-Wolfowitz algorithm. Despite the substantial computational eﬃciency of SPSA compared to KieferWolfowitz, the asymptotic eﬃciency of SPSA is identical to the Kiefer-Wolfowitz algorithm. Thus SPSA can be viewed as a novel application of randomization in gradient estimation to break the curse of dimensionality. It can be proved that, like the ﬁnite gradient scheme, SPSA also has a bias O(μ2n ) (Spall, 2003). Remarks: In the SPSA algorithm above, the elements of dn were chosen according to a Bernoulli distribution. In general, it is possible to generate the elements of d according to other distributions, as long as these distributions are symmetric, zero mean, and have bounded inverse moments; see Spall (2003) for a complete exposition of SPSA. Convergence of Adaptive Brownian Dynamics Simulation Algorithm 4.3 Here we show that the estimates θn generated by algorithm 4.3 (whether using the Kiefer-Wolfowitz or SPSA algorithm) converge to a local minimum of the loss function. Theorem 4.6 For batch size L → ∞ in algorithm 4.2, the sequence of estimates {θn } generated by the controlled Brownian dynamics simulation algorithm 4.3, converge at n → ∞ to a the locally optimal PMF estimate θ∗ (deﬁned in eq. 4.68) with probability 1.

116

Sensor Adaptive Signal Processing of Biological Nanotubes

Outline of Proof Since by construction of the BD algorithm, for ﬁxed θ, Qn (θ, λ) are independent and identically distributed random variables, the proof of the above theorem involves showing strong convergence of a stochastic gradient algorithm with i.i.d. observations—which is quite straightforward. In Kushner and Yin (1997), almost sure convergence of stochastic gradient algorithms for state dependent Markovian noise under general conditions is presented. For the independent and identically distributed case we only need to verify the following condition for convergence. Condition A.4.11 in section 8.4 of Kushner and Yin (1997) requires uniform θ Qn (θn , λ) in equation 4.70. This holds since the discretized integrability of ∇ (θ,λ) (θ,λ) (L) ≥ 1, implying that the estimate Iˆn version of the passage time τˆ R1 ,R2

from equation 4.60 in algorithm 4.2 is uniformly bounded. Thus the evaluated loss Qn (θ, λ) (eq. 4.66) is uniformly bounded. This in turn implies that the ﬁnite diﬀerence estimate (eq. 4.72) for the Kiefer-Wolfowitz algorithm or equation 4.73 of the SPSA algorithm are uniformly bounded, which implies uniform integrability. Then theorem 4.3 of Kushner and Yin (1997) implies that the sequence {θn } generated by the above controlled BD simulation algorithm 4.3 converges with probability 1 to the ﬁxed points of the following ordinary diﬀerential equation: ' ( dθ θ Q(θ), θ Qn (θ, λ) = −∇ = −E ∇ dt λ∈Λ

namely the set Θ∗ of local minima that satisfy the second-order suﬃcient conditions in equation 4.69.

4.7

Conclusions This chapter has presented novel signal-processing and stochastic optimization (control) algorithms for two signiﬁcant problems in ion channels—namely, the gating problem and the permeation problem. For the gating problem we presented novel discrete stochastic optimization algorithms and also a multiarmed bandit formulation for activating ion channels on a biological chip. For the permeation problem, we presented an adaptive controlled Brownian dynamics simulation algorithm for estimating the structure of the ion channel. We refer the reader to Krishnamurthy and Chung (a,b) for further details of the adaptive Brownian dynamics algorithm and convergence proofs. The underlying theme of this chapter is the idea of sensor adaptive signal processing that transcends standard statistical signal processing (which deals with extracting signals from noisy measurements) to address the deeper issue of how to dynamically minimize sensor costs while simultaneously extracting a signal from noisy measurements. As we have seen in this chapter, the resulting problem is a dynamic stochastic optimization problem—whereas all traditional statistical signalprocessing problems (such as optimal and adaptive ﬁltering, parameter estima-

4.7

Conclusions

117

tion) are merely static stochastic optimization problems. Furthermore, we have demonstrated the use of sophisticated new algorithms such as stochastic discrete optimization, partially observed Markov decision processes (POMDP), bandit processes, multiparticle Brownian dynamics simulations, and gradient estimation-based stochastic approximation algorithms. These novel methods provide a powerful set of algorithms that will supersede conventional signal-processing tools such as elementary stochastic approximation algorithms (e.g., the LMS algorithm), subspace methods, etc. In current work we are examining several extensions of the ideas in this chapter, including estimating the shape of ion channels. Finally, it is hoped that this paper will motivate more researchers in the areas of statistical signal processing and stochastic control to apply their expertise to the exciting area of ion channels. Acknowledgments The author thanks his collaborator Dr. Shin-Ho Chung of the Australian National University, Canberra, for several useful discussions and for preparing ﬁg. 4.4. This chapter is dedicated to the memory of the author’s mother, Mrs. Bani Krishnamurthy, who passed away on November 26, 2004.

5

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

Timothy R. Field

5.1

Context In this chapter we outline some emerging ideas relating to the detection of spin populations arising in magnetic resonance imaging (MRI). MRI techniques are already at a mature stage of development, widely used as a research tool and practised in clinical medicine, and provide the primary noninvasive method for studying internal brain structure and activity, with excellent spatial resolution. A lot of attention is typically paid to detecting spatial anomalies in brain tissue, e.g., brain tumors, and in localizing certain areas of the brain that correspond to particular stimuli, e.g., within the motor, visual, and auditory cortices. More recently, functional magnetic resonance imaging (fMRI) techniques have been successfully applied in psychological studies analyzing the temporal response of the brain to simple known stimuli. The possibility of enhancing these techniques to deal with more sophisticated neurological processes, characterized by physiological changes occurring on shorter timescales, provides the motivation for developing real-time imaging techniques, where there is much insight to be gained, e.g., in tracking the auditory system. The concept of MRI is straightforward and can be brieﬂy summarized as follows (e.g., Brown and Semelka, 2003). The physical phenomenon of magnetic resonance is due to the Zeeman eﬀect, which accounts for the splitting of energy levels in atomic nuclei due to an applied magnetic ﬁeld. In the presence of a background magnetic ﬁeld B0 , the majority of protons (hydrogen nuclei) N+ in a material tend toward their minimum energy spin conﬁguration, with their spin vectors aligned with B0 , in the “spin-up” state (and thus in the minimum energy eigenstate of the quantum mechanical spin Hamiltonian). A smaller number N− take up the excited state with spin antiparallel to B0 , the “spin-down” state. According to statistical mechanics arguments, the ratio of the spin-up to spin-down populations N+ /N− is governed by the Bose-Einstein distribution. Thus, the majority of protons are able to absorb available energy and make a transition to an excited state, provided the energy is applied in a way that matches the resonance properties of

120

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

the protons in the material. The details of this energy absorption stem from the design of the MR experiment. A pulse of energy is applied to the material inside the background ﬁeld, which is absorbed (deterministically) and subsequently radiated away in a random-type fashion. Although the intrinsic nuclear atomic energies are very large in comparison, the diﬀerences between the spin-up/down energy levels, as predicted by the Zeeman interaction, lie in the radio frequency (RF) band of the electromagnetic spectrum. It is these energy diﬀerences (quantum gaps, if you like) that give rise to the received signal in MR experiments, through subsequent radiation of the absorbed energy. Although it is a quantum eﬀect, many ideas from classical physics can be drawn into play in terms of a qualitative understanding of the origin of the MR signal. Conservation of energy is a key component, as also are the precessional dynamics of the net magnetization and Faraday’s law of electromagnetic induction, described in section 5.2. It turns out that molecular environment aﬀects the precise values of splitting in proton spin energy levels, so that, e.g., a proton in a fat molecule has a diﬀerent absorption spectrum from a proton in water. The determination of the values of these “chemical shifts” is the basis of magnetic resonance spectroscopy (MRS). Once these “ﬁngerprint” resonance absorption values are known, one can design spectrally selective RF pulses in MR experiments so that only protons in certain chemical environments, with magnetic resonance absorption properties matching the frequency (photon energy) spectrum of the pulse, are excited into states of higher energy. The point of view taken here is that the spin population itself is the object of primary signiﬁcance in constructing an image, especially when it comes to the study of neural brain activity. The reasons for this emphasis on the spin population size, as opposed to certain notions of its time average behavior, are twofold. First, it is the spin density of protons, in a given molecular environment, that is the fundamental object of the detection. Second, the spin population is something that can in principle be studied in real time. In contrast, the more familiar techniques known as T 1- and T 2-weighted imaging involve a statistical time average, measuring the respective “spin-lattice” and “spin-spin” relaxation times: T 1 is the average restoration time for the longitudinal component and T 2 the average decay time for the transverse component of the local magnetic ﬁeld in the medium. Thereby, information concerning the short timescale properties of the population is necessarily lost. It is argued here that in principle one can infer the (local) size of a resonant spin population, from a large population whose dynamics is arbitrary, through a novel type of signal-processing technique which exploits certain ingredients in the physics of the spin dephasing process occurring in T 2 relaxation. The emphasis in this chapter is on the ideas and novel concepts, their significance in drawing together ideas from physics and signal processing, and the implications and new perspectives they provide in the context of magnetic resonance imaging. The detailed underlying mathematics, although an essential part of the sustained theoretical development, is highly technical and so reported separately, in elsewhere Field (2005). In section 5.2 we give the background to the description of an electromagnetic ﬁeld interacting with a random population, in terms of

5.2

Conceptual Framework

121

the mathematics of stochastic diﬀerential equations (SDEs). Section 5.3 illustrates how this dynamics can be applied to an arbitrary spin population, in the context of constructing an image in magnetic resonance experiments. The possible implications of these ideas for future MRI systems are described in section 5.4, where we identify certain domains of validity of the proposed model and the appropriate corresponding choice of some design parameters that would be necessary for successful implementation. We provide, without proof, two key mathematical results behind this line of development, concerning the dynamics of spin-spin relaxation in proposition 5.1 and the observability of the spin population through statistical analysis of the phase ﬂuctuations in theorem 5.1. The reader is referred to Field (2005) for their detailed mathematical derivation.

5.2

Conceptual Framework Our purpose is to identify the dynamics of spin-spin (T 2) relaxation using a geometrical description of the transverse spin population, and the mathematics of SDEs (e.g., Oksendal, 1998) to derive the continuous time statistical properties of the net transverse magnetization. In doing so we are led to an exact expression for the “hidden” state (the spin population level) in terms of the additional phase degrees of freedom in the MR signal, described in section 5.3. In our discussion we shall not conﬁne ourselves to any speciﬁc choice of population model, and thus encompass the possibility of describing the highly nonstationary behavior characteristic of brain signals that encode information in real time in response to stimuli, such as occur in the auditory system. Let us assume that the RF pulse is applied at a “pulse ﬂip” angle of 90◦ to the longitudinal B0 direction. As a result, RF energy ΔE is absorbed, and the net local magnetization is rotated into the transverse plane. Each component spin vector then rotates about the longitudinal axis at (approximately) the Larmor precessional frequency ω0 , governed by the following relations: ΔE = ˜ω0 = hγB0 /2π,

(5.1)

where γ is the gyromagnetic ratio (which varies throughout space depending on the details of the molecular environment). The resulting motion of the net local magnetization vector Mt can be understood by analogy with the Eulerian top in classical mechanics. As time progresses following the pulse, energy is transferred from the proton spins to the surroundings during the process of “spin-lattice” relaxation, and the longitudinal component of the net magnetization is gradually restored to the equilibrium value prior to the pulse being applied. Likewise, random exchange of energy between neighboring spins and small inhomogeneities in the total magnetic ﬁeld cause perturbations in the phases of the transverse spin components and “dephasing” occurs, so that the net transverse component of magnetization decays to zero. The motion of Mt can thus be visualized as a

122

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

Geometry of transverse spin population—each point represents a constituent proton, with respect to which each connecting vector (in the direction of the random walk away from the origin) represents the transverse component of the associated spin vector.

Figure 5.1

precession about the longitudinal axis, over the surface of a cone whose opening angle tends from π to 0 as equilibrium is restored. It is convenient for visualization purposes to work in a rotating frame of reference, rotating at the Larmor frequency ω0 about the longitudinal axis. (It is worth remarking that in the corresponding situation for radar scattering, ω0 is the Doppler frequency arising from bulk wave motion in the scattering surface.) This brings each transverse spin vector to rest, for a perfect homogeneous B0 . Nevertheless it is the local inhomogeneities in the total (background plus internal) magnetic ﬁeld, due to the local magnetic properties of the medium, that give rise to spin-spin (or T 2) relaxation constituting the (random) exchange of energy between spins. The local perturbations in the total magnetic ﬁeld can reasonably be considered as independent for each component spin, so that the resultant (transverse) spin vector can be modeled as a sum of independent random phasors. Thus, the pertinent expression for the resultant net transverse magnetization is s(j)

(Nt )

Mt

56 Nt 4 7 (j) aj exp iϕt , = j=1

(5.2)

5.2

Conceptual Framework

123

with (ﬂuctuating) spin population size Nt , random phasor step s(j) , and component amplitudes aj . Observe that this random walk type model is directly analogous to what has been used in the dynamical description of Rayleigh scattering (Field and Tough, 2003b) introduced in Jakeman (1980). Thus, the geometrical spin structure, transverse to the background longitudinal magnetic ﬁeld, lies in correspondence with the plane-polarized (perpendicular to the propagation vector) components of the electromagnetic ﬁeld arising in (radar) scattering and (optical) propagation (Field and Tough, 2003a), where the same types of ideas, albeit for speciﬁc types of (stationary) populations, have been experimentally veriﬁed (Field and Tough, 2003b, 2005). The geometry of ﬁg. 5.1 illustrates the isomorphism between the transverse spin structure of atomic nuclei and photon spin for a plane-polarized state, the latter familiar from radio theory as the (complex valued) electromagnetic amplitude perpendicular to the direction of propagation. Indeed, this duality between photon spin (EM wave polarization) and nuclear spin is a key conceptual ingredient in this development. The dynamical structure of equation 5.2 is supplied by a (phase) diﬀusion (j) model (Field and Tough, 2003b) which takes the component phases {ϕt } to be a collection of (displaced) Wiener processes evolving on a suitable timescale. Thus 1 (j) (j) ϕt = Δ(j) + B 2 Wt , with initialization Δ(j) . In magnetic resonance the 90◦ pulse explained above causes the spin phasors s(j) to be aligned initially; thus Δ(j) are identical for all j. Let T be the phase decoherence or spin-spin relaxation time (T 2) (j) such that {ϕt | t ≥ T } have negligible correlation. Then for t ≥ T , deﬁning the 1 (N ) 2 (normalized) resultant by mt = limN →∞ Mt /N , we obtain the resultant spin dynamics or net magnetization in the transverse plane (cf. Field, 2005). Proposition 5.1 For suﬃciently large times t ≥ T the spin dynamics, for a constant population, is given by the complex Ornstein-Uhlenbeck equation 1 1 1 dmt = − Bmt dt + B 2 a2 2 dξt , 2

(5.3)

where ξt is a complex-valued Wiener process (satisfying |dξt |2 = dt, dξt2 = 0). As the collection of spins radiates the absorbed energy, this gives rise to the received MR signal ψt (the free induction decay or FID), which is detected through the generation of electromotive force in a coil apparatus, due to the timevarying local magnetic ﬁeld. This eﬀect is the result of Faraday’s law, i.e., Maxwell’s vector equation ∇ × E = −∂B/∂t integrated around a current loop. The receiver coil is placed perpendicular to the transverse plane, and so only the transverse components of the change in magnetic ﬂux contribute to the FID. The MR signal thus corresponds to an amplitude process that represents the (time derivative of the) net transverse magnetization, and has the usual in-phase (I) and quadraturephase (Q) components familiar from radio theory, so ψ = I + iQ. Moreover it can be spatially localized using standard gradient ﬁeld techniques (e.g., Brown and Semelka, 2003). The constant B in equation 5.3, which has dimensions of frequency, is (proportional to) the reciprocal of the spin-spin relaxation time T 2.

124

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

In the case that the population size ﬂuctuates over the timescale of interest, we introduce the continuum population variate xt , and the receiver amplitude then has the compound representation 1

ψt = xt2 mt ,

(5.4)

in which xt and mt are independent processes.

5.3

Image Extraction An essential desired ingredient is the ability to handle real-time nonstationary behavior in the spin population. Moreover we do not wish to make any prior assumptions concerning the statistical behavior of this population, for it is this unknown state that we are trying to estimate from the observed MR signal. Indeed, its value depends on the external stimuli in a way that is not well understood, and it is precisely our purpose to uncover the nature of this dependence by processing the observable data in an intelligent way. But in doing so, we do not wish to prejudice our notions of spin population behavior, beyond some very generic assumptions set in place for the purpose of mathematical tractability. Accordingly we shall assume that the population process xt is an Ito process, i.e., that it satisﬁes the SDE (e.g., Oksendal, 1998): 1

(x)

dxt = Abt dt + (2AΣt ) 2 dWt ,

(5.5)

in which the respective drift and diﬀusion parameters bt , Σt are (real-valued) stochastic processes (not necessarily Ito), and in general include the eﬀects of nonstationary and non-Gaussian behavior. Thus, we make no prior assumption concerning the nature of these parameters, and wish to estimate the values of {xt } from our observations of the MR signal. The SDE for the resultant phase can be derived from equations 5.3, 5.4 and 5.5. Intriguingly, the behavior for a general population is functionally independent of the parameters bt and Σt , the eﬀect of these parameters coming through in the resulting evolutionary structure of the processes xt , ψt . Calculation of the resulting squared phase ﬂuctuations leads to the key noteworthy result of this chapter. Theorem 5.1 The spin population is observable through the intensity-weighted squared phase ﬂuctuations of the (FID) signal according to xt =

2 zt dθt2 /dt B

(5.6)

throughout space and time, where zt = |ψt |2 and the ﬁeld mt is scaled so that a2 = 1.

5.4

Future Systems

125

The signiﬁcance of this result is that the relation 5.6 is exact, instantaneous in nature, and moreover independent of the dynamics of the spin population. It is straightforward to illustrate this result with data sampled in discrete time (Field, 2005). The result approaches exactness as the pulse repetition rate tends to inﬁnity. More precisely, for discretely sampled data theorem 5.1 implies that zi δθi2 ∝ xi n2i ,

(5.7)

where i is a discrete time index and {ni } are an independent collection of N (0, 1) distributed random variables. This can be used to estimate the state xt via local time averaging. Applying a smoothing operation ·Δ to the left-hand side (the “observations”) of 5.7 with window Δ = [t0 − Δ, t0 + Δ] yields an approximation to xt0 , with an error that tends to zero as the number of pulses inside Δ tends to inﬁnity and Δ → 0 (Field, 2005). In this respect we observe as a consequence that in order to achieve an improved signal-to-noise ratio (SNR), it is suﬃcient merely to increase the pulse rate, without (necessarily) requiring a high-amplitude signal. The term SNR is used here in the sense of extracting the signal xt from ψt , thus overcoming the noise in mt (cf. eq. 5.4). This inference capability has been demonstrated in analysis of synthetic data, with population statistics chosen deliberately not to conform to the types of model usually encountered (Field, 2005). Instead of “ﬁltering out” the noise to obtain the signal, we have exploited useful information contained in the random ﬂuctuations in the phase. Indeed the phase noise is so strongly colored that, provided it is sampled at high enough frequency, it enables us to extract the precise values of the underlying state, in this case the population size of spins that have absorbed the RF energy. A comparison of time series of the state inferred from the observations alone, with the exact values recorded in generation of the synthetic data, shows a very high degree of correlation (Field, 2005).

5.4

Future Systems For MR imaging purposes, we have proposed focusing on the local size of the transverse spin population that results from the applied RF pulse, which is assumed to be spectrally selective. Our results demonstrate, at a theoretical/simulation level, how this “signal” can be extracted through statistical analysis of the (phase) ﬂuctuations in the received (complex) amplitude signal. In the MR context, the spin population, which assumes a Bose-Einstein distribution in equilibrium, responds in a dynamical nonstationary fashion to applied RF radiation, and our results suggest means for detecting this dynamical behavior using a combination of physical modeling and novel statistical signal-processing techniques based on the mathematics of SDEs. The idea of focusing attention on the spin population size appears to be more intuitive than measurements of the single point statistic T 2, the average transverse

126

Spin Diﬀusion: A New Perspective in Magnetic Resonance Imaging

decay time, that is commonly used in T 2-weighted imaging (Brown and Semelka, 2003). Primarily here one is concerned with estimating the population of spins that absorb energy from the RF pulse at diﬀerent locations, since this implies the spin density of (protons in) the molecular environments of interest whose energy resonances (predicted through the Zeeman interaction) match those of the designed pulse. Our discussion demonstrates how the error in the estimate of a random population interacting with the electromagnetic ﬁeld can be reduced to an arbitrarily small amount. In the MR context this suggests that a moderate/low background ﬁeld strength B0 and short pulse repetition time T R could be suﬃcient for generating real-time images. A complication posed by the high speciﬁc absorption rate SAR ∝ B02 /T R (which measures the deposition of energy in the medium in the form of heat) for short T R can presumably be overcome by using short-duration bursts of RF energy (just as in radar systems) to detect short timescale properties. In summary, the results suggest means for real-time image construction in fMRI experiments at moderate to low magnetic ﬁeld strength, and identify the choice of parameters necessary in the design of fMRI systems for the technique to be valid. There are indications that, for appropriate parameter values, the technique should succeed in extracting the real-time spin population behavior in the context of MR, without prior assumptions concerning the nature of this population needing to be made. Indeed, at the level of simulated data, this result has been veriﬁed (see ﬁg. 2 in Field, 2005). The ability to track the spin population from the observed amplitude in local time suggests the possibility of detecting spatiotemporal changes in neural activity in the brain, e.g., in the localization and tracking of evoked responses. Acknowledgments The author is grateful to John Bienenstock and Michael Noseworthy of the Brain Body Institute, Simon Haykin, and Jos´e Pr´ıncipe.

6

What Makes a Dynamical System Computationally Powerful?

Robert Legenstein and Wolfgang Maass

We review methods for estimating the computational capability of a complex dynamical system. The main examples that we discuss are models for cortical neural microcircuits with varying degrees of biological accuracy, in the context of online computations on complex input streams. We address in particular the question to what extent earlier results about the relationship between the edge of chaos and the computational power of dynamical systems in discrete time for oﬀ-line computing also apply to this case.

6.1

Introduction Most work in the theory of computations in circuits focuses on computations in feedforward circuits, probably because computations in feedforward circuits are much easier to analyze. But biological neural circuits are obviously recurrent; in fact the existence of feedback connections on several spatial scales is a characteristic property of the brain. Therefore an alternative computational theory had to be developed for this case. One neuroscientist who emphasized the need to analyze information processing in the brain in the context of dynamical systems theory was Walter Freeman, who started to write a number of inﬂuential papers on this topic in the 1960s; see Freeman (2000, 1975) for references and more recent accounts. The theoretical investigation of computational properties of recurrent neural circuits started shortly afterward. Earlier work focused on the engraving of attractors into such systems in order to restrict the dynamics to achieve well-deﬁned properties. One stream of work in this direction (see, e.g., Amari, 1972; Cowan, 1968; Grossberg, 1967; Little, 1974) culminated in the inﬂuential studies of Hopﬁeld regarding networks with stable memory, called Hopﬁeld networks (Hopﬁeld, 1982, 1984), and the work of Hopﬁeld and Tank on networks which are able to ﬁnd approximate solutions of hard combinatorial problems like the traveling salesman

128

What Makes a Dynamical System Computationally Powerful?

problem (Hopﬁeld and Tank, 1985, 1986). The Hopﬁeld network is a fully connected neural network of threshold or threshold-like elements. Such networks exhibit rich dynamics and are chaotic in general. However, Hopﬁeld assumed symmetric weights, which strongly constrains the dynamics of the system. Speciﬁcally, one can show that only point attractors can emerge in the dynamics of the system, i.e., the activity of the elements always evolves to one of a set of stable states which is then kept forever. Somewhat later the alternative idea arose to use the rich dynamics of neural systems that can be observed in cortical circuits rather than to restrict them (Buonomano and Merzenich, 1995). In addition one realized that one needs to look at online computations (rather than oﬀ-line or batch computing) in dynamical systems in order to capture the biologically relevant case (see Maass and Markram, 2005, for deﬁnitions of such basic concepts of computation theory). These eﬀorts resulted in the “liquid state machine” model by Maass et al. (2002) and the “echo state network” by Jaeger (2002), which were introduced independently. The basic idea of these models is to use a recurrent network to hold and nonlinearly transform information about the past input stream in the high-dimensional transient state of the network. This information can then be used to produce in real time various desired online outputs by simple linear readout elements. These readouts can be trained to recognize common information in dynamical changing network states because of the high dimensionality of these states. It has been shown that these models exhibit high computational power (Jaeger and Haas, 2004; Joshi and Maass, 2005; Legenstein et al., 2003). However, the analytical study of such networks with rich dynamics is a hard job. Fortunately, there exists a vast body of literature on related questions in the context of many diﬀerent scientiﬁc disciplines in the more general framework of dynamical systems theory. Speciﬁcally, a stream of research is concerned with system dynamics located at the boundary region between ordered and chaotic behavior, which was termed the “edge of chaos.” This research is of special interest for the study of neural systems because it was shown that the behavior of dynamical systems is most interesting in this region. Furthermore, a link between computational power of dynamical systems and the edge of chaos was conjectured. It is therefore a promising goal to use concepts and methods from dynamical systems theory to analyze neural circuits with rich dynamics and to get in this way better tools for understanding computation in the brain. In this chapter, we will take a tour, visiting research concerned with the edge of chaos and eventually arrive at a ﬁrst step toward this goal. The aim of this chapter is to guide the reader through a stream of ideas which we believe are inspiring for research in neuroscience and molecular biology, as well as for the design of novel computational devices in engineering. After a brief introduction of fundamental principles of dynamical systems theory and chaos in section 6.2, we will start our journey in section 6.3 in the ﬁeld of theoretical biology. There, Kauﬀman studied questions of evolution and emerging order in organisms. We will see that depending on the connectivity structure,

6.2

Chaos in Dynamical Systems Table 6.1

129

General Properties of Various Types of Dynamical Systems

Analog states?

Cellular Automata

Iterative Maps

Boolean Circuits

Cortical Microcircuits and Gene Regulation networks

no

yes

no

yes

Continuous time?

no

no

no

yes

High dimensional?

yes

no

yes

yes

With noise?

no

no

no

yes

With online input?

no

no

usually no

yes

networks may operate either in an ordered or chaotic regime. Furthermore, we will encounter the edge of chaos as a transition between these dynamic regimes. In section 6.4, our tour will visit the ﬁeld of statistical physics, where Derrida and others studied related questions and provided new methods for their mathematical analysis. In section 6.5 the reader will see how these ideas can be applied in the theory of computation. The study of cellular automata by Wolfram, Langton, Packard, and others led to the conjecture that complex computations are best performed in systems at the edge of chaos. The next stops of our journey in sections 6.6 and 6.7 will bring us close to our goal. We will review work by Bertschinger and Natschl¨ ager, who analyzed real-time computations on the edge of chaos in threshold circuits. In section 6.8, we will brieﬂy examine self-organized criticality, i.e., how a system can adapt its own dynamics toward the edge of chaos. Finally, section 6.9 presents the eﬀorts of the authors of this chapter to apply these ideas to computational questions in the context of biologically realistic neural microcircuit models. In this section we will analyze the edge of chaos in networks of spiking neurons and ask the following question: In what dynamical regimes are neural microcircuits computationally most powerful? Table 6.1 shows that neural microcircuits (as well as gene regulation networks) diﬀer in several essential aspects from those examples for dynamical systems that are commonly studied in dynamical systems theory.

6.2

Chaos in Dynamical Systems In this section we brieﬂy introduce ideas from dynamical systems theory and chaos. A few slightly diﬀerent deﬁnitions of chaos are given in the literature. Although we will mostly deal here with systems in discrete time and discrete state space, we start out with the well-established deﬁnition of chaos in continuous systems and return to discrete systems later in this section. The subject known as dynamics deals with systems that evolve in time (Strogatz, 1994). The system in question may settle down to an equilibrium, may enter a periodic trajectory (limit cycle), or do something more complicated. In Kaplan

130

What Makes a Dynamical System Computationally Powerful?

and Glass (1995) the dynamics of a deterministic system is deﬁned as being chaotic if it is aperiodic and bounded with sensitive dependence on initial conditions. The phase space for an N -dimensional system is the space with coordinates x1 , . . . , xN . The state of an N -dimensional dynamical system at time t is represented by the state vector x(t) = (x1 (t), . . . , xN (t)). If a system starts in some initial condition x(0), it will evolve according to its dynamics and describe a trajectory in state space. A steady state of the system is a state xs such that if the system evolves with xs as its initial state, it will remain in this state for all future times. Steady states may or may not be stable to small outside perturbations. For a stable steady state, small perturbations die out and the trajectory converges back to the steady state. For an unstable steady state, trajectories do not converge back to the steady state after arbitrarily small perturbations. The general deﬁnition of an attractor is a set of points or states in state space to which trajectories within some volume of state space converge asymptotically over time. This set itself is invariant under the dynamic evolution of the system.1 Therefore, a stable steady state is a zero-dimensional, or point, attractor. The set of initial conditions that evolve to an attractor A is called the basin of attraction of A. A limit cycle is an isolated closed trajectory. Isolated means that neighboring trajectories are not closed. If released in some point of the limit cycle, the system ﬂows on the cycle repeatedly. The limit cycle is stable if all neighboring trajectories approach the limit cycle. A stable limit cycle is a simple type of attractor. Higherdimensional and more complex types of attractors exist. In addition, there exist also so-called strange, or chaotic attractors. For example all trajectories in a high-dimensional state space might be brought onto a twodimensional surface of some manifold. The interesting property of such attractors is that, if the system is released in two diﬀerent experiments from two points on the attractor which are arbitrarily close to each other, the subsequent trajectories remain on the attractor surface but diverge away from each other. After a suﬃcient time, the two trajectories can be arbitrarily far apart from each other. This extreme sensitivity to initial conditions is characteristic of chaotic behavior. In fact, exponentially fast divergence of trajectories (characterized by a positive Lyapunov exponent) is often used as a deﬁnition of chaotic dynamics (see, e.g., Kantz and Schreiber, 1997). There might be a lot of structure in chaotic dynamics since the trajectory of a high-dimensional system might be projected merely onto a twodimensional surface. However, since the trajectory on the attractor is chaotic, the exact trajectory is practically not predictable (even if the system is deterministic). Systems in discrete time and with a ﬁnite discrete state space diﬀer from continuous systems in several aspects. First, since state variables are discrete, trajectories can merge, whereas in a continuous system they may merely approximate each other. Second, since there is a ﬁnite number of states, the system must eventually reenter a state previously encountered and will thereafter cycle repeatedly through this state cycle. These state cycles are the dynamical attractors of the discrete system. The set of states ﬂowing into such a state cycle or lying on it constitutes the basin of attraction of that state cycle. The length of a state cycle is the number

6.3

Randomly Connected Boolean Networks

131

of states on the cycle. For example, the memory states in a Hopﬁeld network (a network of artiﬁcial neurons with symmetric weights) are the stable states of the system. A Hopﬁeld network does not have state cycles of length larger than one. The basins of attraction of memory states are used to drive the system from related initial conditions to the same memory state, hence constituting an associative memory device. Characteristic properties of chaotic behavior in discrete systems are a large length of state cycles and high sensitivity to initial conditions. Ordered networks have short state cycles and their sensitivity to initial conditions is low, i.e., the state cycles are quite stable. We note that state cycles can be stable with respect to some small perturbations but unstable to others. Therefore, “quite stable” means in this context that the state cycle is stable to a high percentage of small perturbations. These general deﬁnitions are not very precise and will be made more speciﬁc for each of the subsequent concrete examples.

6.3

Randomly Connected Boolean Networks The study of complex systems is obviously important in many scientiﬁc areas. In genetic regulatory networks, thousands or millions of coupled variables orchestrate developmental programs of an organism. In 1969, Kauﬀman started to study such systems in the simpliﬁed model of Boolean networks (Kauﬀman, 1969, 1993). He discovered some surprising results which will be discussed in this section. We will encounter systems in the ordered and in the chaotic phase. The speciﬁc phase depends on some simple structural feature of the system, and a phase transition will occur when this feature changes. A Boolean network consists of N elements and connections between them. The state of its elements is described by binary variables x1 , . . . , xN . The dynamical behavior of each variable, whether it will be active (1) or inactive (−1) at the next time step, is governed by a Boolean function.2 The (directed) connections between the elements describe possible interactions. If there is a connection from element i to element j, then the state of element i inﬂuences the state of element j in the next time step. We say that i is an input of element j. An initial condition is given by a value for each variable x(0). Thereafter, the state of each element evolves according to the Boolean function assigned to it. We can describe the dynamics of the system by a set of iterated maps x1 (t + 1) = f1 (x1 (t), . . . , xN (t)) .. . xN (t + 1) = fN (x1 (t), . . . , xN (t)), where f1 , . . . , fN are Boolean functions.3 Here, all state variables are updated in parallel at each time step. The stability of attractors in Boolean networks can be studied with respect to

132

What Makes a Dynamical System Computationally Powerful?

minimal perturbations. A minimal perturbation is just the ﬂip of the activity of a single variable to the opposite state. Kauﬀman studied the dynamics of Boolean networks as a function of the number of elements in the network N , and the average number K of inputs to each element in the net. Since he was not interested in the behavior of particular nets but rather in the expected dynamics of nets with some given N and K, he sampled at random from the ensemble of all such networks. Thus the K inputs to each element were ﬁrst chosen at random and then ﬁxed, and the Boolean function assigned to each element was also chosen at random and then ﬁxed. For each such member of the ensemble, Kauﬀman performed computer simulations and examined the accumulated statistics. The case K = N is especially easy to analyze. Since the Boolean function of each element was chosen randomly from a uniform distribution, the successor to each circuit state is drawn randomly from a uniform distribution among the 2N possible states. This leads to long state cycles. The median state cycle length is 0.5 · 2N/2 . Kauﬀman called such exponentially long state cycles chaotic.4 These state cycles are unstable to most perturbations, hence there is a strong dependence on initial conditions. However, only a few diﬀerent state cycles exit in this case: the expected number of state cycles is N/e. Therefore, there is some characteristic structure in the chaotic behavior in the sense that the system will end up in one of only a few long-term behaviors. As long as K is not too small, say K ≥ 5, the main features of the case K = N persist. The dynamics is still governed by relatively few state cycles of exponential length, whose expected number is at most linear in N . For K ≥ 5, these results can be derived analytically by a rough mean ﬁeld approximation. For smaller K, the approximation becomes inaccurate. However, simulations conﬁrm that exponential state cycle length and a linear number of state cycles are characteristic for random Boolean networks with K ≥ 3. Furthermore, these systems show high sensitivity to initial conditions (Kauﬀman, 1993). The case K = 2 is of special interest. There, a phase transition from ordered to chaotic dynamics occurs. Numerical simulations of these systems have revealed the following characteristic features of random Boolean networks with √ K = 2 (Kauﬀman, 1969). The expected median state cycle length is about N . Thus, random Boolean networks with K = 2 often conﬁne their dynamical behavior to tiny subvolumes of their state space, a strong sign of order. A more detailed analysis shows that most state cycles are √ short, whereas there are a few long ones. The number of state cycles is about N and they are inherently stable to about 80% to 90% of all minimal transient perturbations. Hence, the state cycles of the system have large basins of attraction and the sensitivity to initial conditions is low. In addition to these characteristics which stand in stark contrast to networks of larger K, we want to emphasize three further features. First, typically at least 70% of the N elements have some ﬁxed active or inactive state which is identical for all the existing state cycles of the Boolean network. This behavior establishes a frozen core of elements. The frozen core creates walls of constancy which break the system into

6.4

The Annealed Approximation by Derrida and Pomeau

133

functionally isolated islands of unfrozen elements. Thus, these islands are prevented from inﬂuencing one another. The boundary regime where the frozen core is just breaking up and interaction between the unfrozen islands becomes possible is the phase transition between order and chaos. Second, altering transiently the activity of a single element typically propagates but causes only alterations in the activity of a small fraction of the elements in the system. And third, deleting any single element or altering its Boolean function typically causes only modest changes in state cycles and transients. The latter two points ensure that “damage” of the system is small. We will further discuss this interesting case in the next section. Networks with K = 1 operate in an ordered regime and are of little interest for us here.

6.4

The Annealed Approximation by Derrida and Pomeau The phase transition from order to chaos is of special interest. As we shall see in the sections below, there are reasons to believe that this dynamical regime is particularly well suited for computations. There were several attempts to understand the emerging order in random Boolean networks. In this section, we will review the approach of Derrida and Pomeau (1986). Their beautiful analysis gives an analytical answer to the question of where such a transition occurs. In the original model, the connectivity structure and the Boolean functions fi of the elements i were chosen randomly but were then ﬁxed. The dynamics of the network evolved according to this ﬁxed network. In this case the randomness is quenched because the functions fi and the connectivity do not change with time. Derrida and Pomeau presented a simple annealed approximation to this model which explains why there is a critical value Kc of K where the transition from order to chaos appears. This approximation also allowed the calculation of many properties of the model. In contrast to the quenched model, the annealed approximation randomly reassigns the connectivity and the Boolean functions of the elements at each time step. Although the assumption of the annealed approximation is quite drastic, it turns out that its agreement with observations in simulations of the quenched model is surprisingly good. The beneﬁts of the annealed model will become clear below. It was already pointed out that exponential state cycle length is an indicator of chaos. In the annealed approximation, however, there are no ﬁxed state cycles because the network is changed at every time step. Therefore, the calculations are based on the dependence on initial conditions. Consider two network states C1 , C2 ∈ {−1, 1}N . We deﬁne the Hamming distance d(C1 , C2 ) as the number of positions in which the two states are diﬀerent. The question is whether two randomly chosen diﬀerent initial network states eventually converge to the same pattern of activity over time. Or, stated in other words, given an initial state C1 (t) which leads to a state C1 at time t and a diﬀerent initial state C2 which leads to (t) (t) (t) a state C2 at time t, will the Hamming distance d(C1 , C2 ) converge to zero for

What Makes a Dynamical System Computationally Powerful?

1

y

=y

t+1

0.75

yt+1

134

t

K=5

0.5

K=3 K=2

0.25

0

0

0.25

0.5

yt

0.75

1

Expected distance between two states at time t + 1 as a function of the state distance yt between two states at time t, based on the annealed approximation. Points on the diagonal yt = yt+1 are ﬁxed points of the map. The curves for K ≥ 3 all have ﬁxed points for a state distance larger than zero. The curve for K = 2 stays close to the diagonal for small state distances but does not cross it. Hence, for K = 2 state distances converge to zero for iterated applications of the map. Figure 6.1

large t? Derrida and Pomeau found that this is indeed the case for K ≤ 2. For K ≥ 3, the trajectories will diverge. To be more precise, one wants to know the probability P1 (m, n) that the distance d(C1 , C2 ) between the states at time t = 1 is m given that the distance d(C1 , C2 ) at time t = 0 was n. More generally, one wants to estimate the probability (t) (t) Pt (m, n) that the network states C1 , C2 obtained at time t are at distance m, given that d(C1 , C2 ) = n at time t = 0. It now becomes apparent why the annealed approximation is useful. In the annealed approximation, the state transition probabilities at diﬀerent time steps are independent, which is not the case in the quenched model. For large N , one can introduce continuous variables n annealed (m, n) for the annealed N = x. Derrida and Pomeau (1986) show that P1 network has a peak around a value m = N y1 where y1 is given by y1 =

1 − (1 − x)K . 2

(6.1)

Similarly, the probability Ptannealed (m, n) has a peak at m = N yt with yt given by yt =

1 − (1 − yt−1 )K 2

(6.2)

for t > 1. The behavior of this iterative map can be visualized in the so called Derrida plot; see ﬁg. 6.1. The plot shows the state distance at time t + 1 as a

6.5

Computation at the Edge of Chaos in Cellular Automata

135

function of the state distance at time t. Points on the diagonal yt = yt+1 are ﬁxed points of the map. For K ≤ 2, the ﬁxed point y = 0 is the only ﬁxed point of the map and it is stable. In fact, for any starting value y1 , we have yt → 0 for t → ∞ in the limit of N → ∞. For K > 2, the ﬁxed point y = 0 becomes unstable and a new stable ﬁxed point y ∗ appears. Therefore, the state distance need no longer always converge to zero. Hence there is a phase transition of the system at K = 2. The theoretical work of Derrida and Pomeau was important because before there was only empirical evidence for this phase transition. We conclude that there exists an interesting transition region from order to chaos in these dynamical systems. For simpliﬁed models, this region can be determined analytically. In the following section we will ﬁnd evidence that such phase transitions are of great interest for the computational properties of dynamical systems.

6.5

Computation at the Edge of Chaos in Cellular Automata Evidence that systems exhibit superior computational properties near a phase transition came from the study of cellular automata. Cellular automata are quite similar to Boolean networks. The main diﬀerences are that connections between elements are local, and that an element may assume one out of k possible states at each time step (instead of merely two states as in Boolean networks). The former diﬀerence implies that there is a notion of space in a cellular automaton. More precisely, a d-dimensional space is divided into cells (the elements of the network). The state of a cell at time t+1 is a function only of its own state and the states of its immediate neighbors at time t. The latter diﬀerence is made explicit by deﬁning a ﬁnite set Σ of cell states. The transition function Δ is a mapping from neighborhood states (including the cell itself) to the set of cell states. If the neighborhood is of size L, we have Δ : ΣL → Σ. What do we mean by “computation” in the context of cellular automata? In one common meaning, the transition function is interpreted as the program and the input is given by the initial state of the cellular automaton. Then, the system evolves for some speciﬁed number of time steps, or until some “goal pattern”— possibly a stable state—is reached. The ﬁnal pattern is interpreted as the output of the automaton (Mitchell et al., 1993). In analogy to universal Turing machines, it has been shown that cellular automata are capable of universal computation (see, e.g., Codd, 1968; Smith, 1971; von Neumann, 1966). That is, there exist cellular automata which, by getting the algorithm to be applied as part of their initial conﬁguration, can perform any computation which is computable by any Turing machine. In 1984, Wolfram conjectured that such powerful automata are located in a special dynamical regime. Later, Langton identiﬁed this regime to lie on a phase transition between order and chaos (see below), i.e., in the regime which corresponds

136

What Makes a Dynamical System Computationally Powerful?

Evolution of one-dimensional cellular automata. Each horizontal line represents one automaton state. Successive time steps are shown as successive horizontal lines. Sites with value 1 are represented by black squares; sites with value 0 by white squares. One example each for an automaton of class 1 (left), class 4 (middle), and class 3 (right) is given. Figure 6.2

to random Boolean networks with K = 2. Wolfram presented a qualitative characterization of one-dimensional cellular automaton behavior where the individual automata diﬀered by their transfer function.5 He found evidence that all one-dimensional cellular automata fall into four distinct classes (Wolfram, 1984). The dynamics for three of these classes are shown in ﬁg. 6.2. Class 1 automata evolve to a homogeneous state, i.e., a state where all cells are in the same state. Hence these systems evolve to a simple steady state. Class 2 automata evolve to a set of separated simple stable states or separated periodic structures of small length. These systems have short state cycles. Both of these classes operate in the ordered regime in the sense that state cycles are short. Class 3 automata evolve to chaotic patterns. Class 4 automata have long transients, and evolve “to complex localized structures” (Wolfram, 1984). Class 3 automata are operating in the chaotic regime. By chaotic, Wolfram refers to the unpredictability of the exact automaton state after a few time steps. Successor states look more or less random. He also talks about nonperiodic patterns. Of course these patterns are periodic if the automaton is of ﬁnite size. But in analogy with the results presented above, one can say that state cycles are very long. Transients are the states that emerge before the dynamics reaches a stable long-lasting behavior. They appear at the beginning of the state evolution. Once the system is on a state cycle, it will never revisit such transient states. The transients of class 4 automata can be identiﬁed with large basins of attraction or high stability of state cycles. Wolfram conjectured that class 4 automata are capable of universal computations. In 1990, Langton systematically studied the space of cellular automata considered by Wolfram with respect to an order parameter λ (Langton, 1990). This

6.5

Computation at the Edge of Chaos in Cellular Automata

137

parameter λ determines a crucial property of the transfer function Δ: the fraction of entries in Δ which do not map to some prespeciﬁed quiescent state sq . Hence, for λ = 0, all local conﬁgurations map to sq , and the automaton state moves to a homogeneous state after one time step for every initial condition. More generally, low λ values lead to ordered behavior. Rules with large λ tend to produce a completely diﬀerent behavior. Langton (1990) stated the following question: “Under what conditions will physical systems support the basic operations of information transmission, storage, and modiﬁcation constituting the capacity to support computation?” When Langton went through diﬀerent λ values in his simulations, he found that all automaton classes of Wolfram appeared in this parameterization. Moreover, he found that the interesting class 4 automata can be found at the phase transition between ordered and chaotic behavior for λ values between about 0.45 and 0.5, values of intermediate heterogeneity. Information-theoretic analysis supported the conjectures of Wolfram, indicating that the edge of chaos is the dominant region of computationally powerful systems. Further evidence for Wolfram’s hypothesis came from Packard (1988). Packard used genetic algorithms to genetically evolve one-dimensional cellular automata for a simple computational task. The goal was to develop in this way cellular automata which behave as follows: The state of the automaton should converge to the all-one state (i.e., the state where every cell is in state 1), if the fraction of one-states in the initial conﬁguration is larger than 0.5. If the fraction of one-states in the initial conﬁguration is below 0.5, it should evolve to the all-zero state. Mutations were accomplished by changes in the transfer function (point mutations which changed only a single entry in the rule table, and crossover which merged two rule tables into a single one). After applying a standard genetic algorithm procedure to an initial set of cellular automaton rules, he examined the rule tables of the genetically evolved automata. The majority of the evolved rule tables had λ values either around 0.23 or around 0.83. These are the two λ values where the transition from order to chaos appears for cellular automata with two states per cell.6 “Thus, the population appears to evolve toward that part of the space of rules that marks the transition to chaos” (Packard, 1988). These results have later been criticized (Mitchell et al., 1993). Mitchell and collaborators reexamined the ideas of Packard and performed similar simulations with a genetic algorithm. The results of these investigations diﬀered from Packard’s results. The density of automata after evolution was symmetrically peaked around λ = 0.5, but much closer to 0.5 and deﬁnitely not in the transition region. They argued that the optimal λ value for a task should strongly depend on the task. Speciﬁcally, in the task considered by Packard one would expect a λ value close to 0.5 for a well-performing rule, because the task is symmetric with respect to the exchange of ones and zeros. A rule with λ < 0.5 tends to decrease the number of ones in the state vector because more entries in the rule table map the state to zero. This can lead to errors if the number of ones in the initial state is slightly larger than 0.5. Indeed, a rule which performs very well on this task, the Gacs-

138

What Makes a Dynamical System Computationally Powerful?

Kurdyumov-Levin (GKL) rule, has λ = 0.5. It was suggested that artifacts in the genetic algorithm could account for the diﬀerent results. We want to return here to the notion of computation. Wolfram and Langton were interested in universal computations. Although universality results for automata are mathematically interesting, they do not contribute much to the goal of understanding computations in biological neural systems. Biological organisms usually face computational tasks which are quite diﬀerent from the oﬀ-line computations on discrete batch inputs for which Turing machines are designed. Packard was interested in automata which perform a speciﬁc kind of computation with the transition function being the “program.” Mitchell at al. showed that there are complex tasks for which the best systems are not located at the edge of chaos. In Mitchell et al. (1993), a third meaning of computation in cellular automata—a kind of “intrinsic” computation—is mentioned: “Here, computation is not interpreted as the performance of a ’useful’ transformation of the input to produce the output. Rather, it is measured in terms of generic, structural computational elements such as memory, information production, information transfer, logical operations, and so on. It is important to emphasize that the measurement of such intrinsic computational elements does not rely on a semantics of utility as do the preceding computational types” (Mitchell et al., 1993). It is worthwhile to note that this “intrinsic” computation in dynamical systems can be used by a readout unit which maps system states to desired outputs. This is the basic idea of the liquid state machine and echo state networks, and it is the basis of the considerations in the following sections. To summarize, systems at the edge of chaos are believed to be computationally powerful. However, the type of computations considered so far are considerably diﬀerent from computations in organisms. In the following section, we will consider a model of computation better suited for our purposes.

6.6

The Edge of Chaos in Systems with Online Input Streams All previously considered computations were oﬀ-line computations where some initial state (the input) is transformed by the dynamics into a terminal state or state cycle (the output). However, computation in biological neural networks is quite diﬀerent from computations in Turing machines or other traditional computational models. The input to an organism is a continuous stream of data and the organism reacts in real time (i.e., within a given time interval) to information contained in this input. Hence, as opposed to batch processing, the input to a biological system is a time varying signal which is mapped to a time-varying output signal. Such mappings are also called ﬁlters. In this section, we will have a look at recent work on real-time computations in threshold networks by Bertschinger and Natschl¨ ager (2004); see also Natschl¨ager et al. (2005)). Results of experiments with closely related hardware models are reported in Schuermann et al. (2005).

6.6

The Edge of Chaos in Systems with Online Input Streams

139

Threshold networks are special cases of Boolean networks consisting of N elements (units) with states xi ∈ {−1, 1}, i = 1, . . . , N . In networks with online input, the state of each element depends on the state of exactly K randomly chosen other units, and in addition on an external input signal u(·) (the online input). At each time step, u(t) assumes the value u ¯ + 1 with probability r and the value u ¯−1 with probability 1 − r. Here, u ¯ is a constant input bias. The transfer function of the elements is not an arbitrary Boolean function but a randomly chosen threshold function of the form ⎞ ⎛ N wij xj (t) + u(t + 1)⎠ , (6.3) xi (t + 1) = Θ ⎝ j=1

where wij ∈ R is the weight of the connection from element j to element i and Θ(h) = +1 if h ≥ 0 and Θ(h) = −1 otherwise. For each element, exactly K of its incoming weights are nonzero and chosen from a Gaussian distribution with zero mean and variance σ 2 . Diﬀerent dynamical regimes of such circuits are shown in ﬁg. 6.3. The top row shows the online input, and below, typical activity patterns of networks with ordered, critical, and chaotic dynamics. The system parameters for each of these circuits are indicated in the phase plot below. The variance σ 2 of nonzero weights was varied to achieve the diﬀerent dynamics. The transition from the ordered to the chaotic regime is referred to as the critical line. Bertschinger and Natschl¨ ager used the approach of Derrida to determine the dynamical regime of these systems. They analyzed the change in Hamming distance between two (initial) states and their successor states provided that the same input is applied in both situations. Using Derrida’s annealed approximation, one can calculate the Hamming distance d(t + 1) given the Hamming distance d(t) of the states at time t. If arbitrarily small distances tend to increase, the network operates in the chaotic phase. If arbitrarily small distances tend to decrease, the network operates in the ordered phase. This can also be expressed by the stability of the ﬁxed point d∗ = 0. In the ordered phase, this ﬁxed point is the only ﬁxed point and it is stable. In the chaotic phase, another ﬁxed point appears and d∗ = 0 becomes unstable. The ﬁxed point 0 is stable if the absolute value of the slope of the map at d(t) = 0, ∂d(t + 1) α= , ∂d(t) d(t)=0 is smaller than 1. Therefore, the transition from order to chaos (the critical line) is given by the line |α| = 1. This line can be characterized by the equation u + 1) + (1 − r)PBF (¯ u − 1) = rPBF (¯

1 , K

(6.4)

where the bit-ﬂip probability PBF (v) is the probability that a single changed state component in the K inputs to a unit that receives the current online input v leads to a change of the output of that unit. This result has a nice interpretation. Consider

140

What Makes a Dynamical System Computationally Powerful?

Threshold networks with online input streams in diﬀerent dynamical regimes. The top row shows activity patterns for ordered (left), critical (middle), and chaotic behavior (right). Each vertical line represents the activity in one time step. Black (white) squares represent sites with value 1 (−1). Successive vertical lines represent successive circuit states. The input to the network is shown above the plots. The parameters σ 2 and u¯ of these networks are indicated in the phase plot below. Further parameters: number of input connections K = 4, number of elements N = 250. Figure 6.3

a value of r = 1, i.e., the input to the network is constant. Consider two network states C1 , C2 which diﬀer only in one state component. This diﬀerent component is on average mapped to K elements (because each gate receives K inputs, hence there are altogether N · K connections). If the bit-ﬂip probability in each of these units is larger than 1/K, then more than one of these units will diﬀer on average in the successor states C1 , C2 . Hence, diﬀerences are ampliﬁed. If the bit-ﬂip probability of each element is smaller than 1/K, the diﬀerences will die out on average.

6.7

Real-Time Computation in Dynamical Systems In the previous section we were interested in the dynamical properties of systems with online input. The work we discussed there was inﬂuenced by recent ideas concerning computation in neural circuits that we will sketch in this section. The idea to use the rich dynamics of neural systems which can be observed in cortical circuits, rather than to restrict them, resulted in the liquid state machine

6.7

Real-Time Computation in Dynamical Systems

141

model by Maass et al. (2002) and the echo state network by Jaeger (2002).7 They assume time series as inputs and outputs of the system. A recurrent network is used to hold nonlinearly transformed information about the past input stream in the state of the network. It is followed by a memoryless readout unit which simply looks at the current state of the circuit. The readout can then learn to map the current state of the system onto some target output. Superior performance of echo state networks for various engineering applications is suggested by the results of Jaeger and Haas (2004). The requirement that the network is operating in the ordered phase is important in these models, although it is usually described with a diﬀerent terminology. The ordered phase can be described by using the notion of fading memory (Boyd and Chua, 1985). Time-invariant fading memory ﬁlters are exactly those ﬁlters which can be represented by Volterra series. Informally speaking, a network has fading memory if its state at time t depends (up to some ﬁnite precision) only on the values (up to some ﬁnite precision) of its input from some ﬁnite time window [t − T, t] into the past (Maass et al., 2002). This is essentially equivalent to the requirement that if there are no longer any diﬀerences in the online inputs then the state diﬀerences converge to 0, which is called echo state property in Jaeger (2002). Besides the fading memory property, another property of the network is important for computations on time series: the pairwise separation property (Maass et al., 2002). Roughly speaking, a network has the pairwise separation property if for any two input time series which diﬀered in the past, the network assumes at subsequent time points diﬀerent states. Chaotic networks have such separation property, but they do not have fading memory since diﬀerences in the initial state are ampliﬁed. On the other hand, very ordered systems have fading memory but provide weak separation. Hence, the separation property and the fading memory property are antagonistic. Ideally, one would like to have high separation on salient diﬀerences in the input stream but still keep the fading memory property (especially for variances in the input stream that do not contribute salient information). It is therefore of great interest to analyze these properties in models for neural circuits. A ﬁrst step in this direction was made by Bertschinger and Natschl¨ ager (2004) in the context of threshold circuits. Similar to section 6.6, one can analyze the evolution of the state separation resulting from two input streams u1 and u2 which diﬀer at time t with some probability. The authors deﬁned the network-mediated separation (short: N M -separation) of a network. Informally speaking, the N M separation is roughly the amount of state distance in a network which results from diﬀerences in the input stream minus the amount of state diﬀerence resulting from diﬀerent initial states. Hence, the N M -separation has a small value in the ordered regime, where both terms are small, but also in the chaotic regime, where both terms are large. Indeed, it was shown that the N M -separation peaks at the critical line, which is shown in ﬁg. 6.4a. Hence, Bertschinger and Natschl¨ ager (2004) oﬀer a new interpretation for the critical line and provide a more direct link between the edge of chaos and computational power.

142

What Makes a Dynamical System Computationally Powerful?

a

b

The network-mediated separation and computational performance for a 3-bit parity task with diﬀerent settings of parameters σ 2 and u¯. (a) The NMseparation peaks at the critical line. (b) High performance is achieved near the critical line. The performance is measured in terms of the memory capacity M C (Jaeger, 2002). The memory capacity is deﬁned as the mutual information M I between the network output and the target function summed over all delays τ > 0 on a test set. More formally, M C = ∞ τ =0 M I(vτ , yτ ), where vτ (·) denotes the network output and yτ (t) = P ARIT Y (u(t − τ ), u(t − τ − 1), u(t − τ − 2)) is the target output. Figure 6.4

Since the separation property is important for the computational properties of the network, one would expect that the computational performance peaks near the critical line. This was conﬁrmed with simulations where the computational task was to compute the delayed 3-bit parity8 of the input signal. The readout neuron was implemented by a simple linear classiﬁer C(x(t)) = Θ(w · x(t) + w0 ) which was trained with linear regression. Note that the parity task is quite complex since it partitions the set of all inputs into two classes which are not linearly separable (and can therefore not be represented by the linear readout alone), and it requires memory. Figure 6.4b shows that the highest performance is achieved for parameter values close to the critical line, although it is not clear why the performance drops for increasing values of u ¯. In contrast to preceding work (Langton, 1990; Packard, 1988), the networks used were not optimized for a speciﬁc task. Only the linear readout was trained to extract the speciﬁc information from the state of the system. This is important since it decouples the dynamics of the network from a speciﬁc task.

6.8

Self-Organized Criticality Are there systems in nature with dynamics located at the edge of chaos? Since the edge of chaos is a small boundary region in the space of possible dynamics, only a vanishingly small fraction of systems should operate in this dynamical regime. However, it was argued that such “critical” systems are abundant in nature (see,

6.8

Self-Organized Criticality

143

e.g., Bak et al., 1988). How is this possible if critical dynamics may occur only accidentally in nature? Bak and collaborators argue that a class of dissipative coupled systems naturally evolve toward critical dynamics (Bak et al., 1988). This phenomenon was termed self-organized criticality (SOC) and it was demonstrated with a model of a sand pile. Imagine building up a sand pile by randomly adding sand to the pile, a grain at a time. As sand is added, the slope will increase. Eventually, the slope will reach a critical value. Whenever the local slope of the pile is too steep, sand will slide oﬀ, therefore reducing the slope locally. On the other hand, if one starts with a very steep pile it will collapse and reach the critical slope from the other direction. In neural systems, the topology of the network and the synaptic weights strongly inﬂuence the dynamics. Since the amount of genetically determined connections between neurons is limited, self-organizing processes during brain development as well as learning processes are assumed to play a key role in regulating the dynamics of biological neural networks (Bornholdt and R¨ ohl, 2003). Although the dynamics is a global property of the network, biologically plausible learning rules try to estimate the global dynamics from information available at the local synaptic level and they only change local parameters. Several SOC rules have been suggested (Bornholdt and R¨ ohl, 2003, 2000; Christensen et al., 1998; Natschl¨ ager et al., 2005). In Bornholdt and R¨ ohl (2003), the degree of connectivity was regulated in a locally connected network (i.e., only neighboring neurons are connected) with stochastic state update dynamics. A local rewiring rule was used which is related to Hebbian learning. The main idea of this rule is that the average correlation between the activities of two neurons contains information about the global dynamics. This rule only relies on information available on the local synaptic level. Self-organized criticality in systems with online input streams (as discussed in section 6.6) was considered in Natschl¨ ager et al. (2005). According to section 6.6, the dynamics of a threshold network is at the critical line if the bit-ﬂip probability PBF (averaged over the external and internal input statistics) is equal to 1/K, where K is the number of inputs to a unit. The idea is to estimate the bit-ﬂip probability of a unit by the mean distance of the internal activation of that unit from the ﬁring threshold. This distance is called the margin. Intuitively, a node with an activation much higher or lower than its ﬁring threshold is rather unlikely to change its output if a single bit in its inputs is ﬂipped. Each node i then applies synaptic scaling to its weights wij in order to adjust itself toward the critical line: ' esti 1 1 if PBF (t) > K 1+ν · wij (t) wij (t + 1) = , (6.5) esti 1 (1 + ν) · wij (t) if PBF (t) < K esti where 0 < ν 1 is the learning rate and PBF (t) is an estimate of the bit-ﬂip i probability PBF of unit i. It was shown by simulations that this rule keeps the dynamics in the critical regime, even if the input statistics change. The computational capabilities of randomly chosen circuits with this synaptic scaling rule acting online during computation were tested in a setup similar to that discussed in section 6.7.

144

What Makes a Dynamical System Computationally Powerful?

The performance of these networks was as high as for circuits where the parameters were a priori chosen in the critical regime, and they stayed in this region. This shows that systems can perform speciﬁc computations while still being able to react to changing input statistics in a ﬂexible way.

6.9

Toward the Analysis of Biological Neural Systems Do cortical microcircuits operate at the edge of chaos? If biology makes extensive use of the rich internal dynamics of cortical circuits, then the previous considerations would suggest this idea. However, the neural elements in the brain are quite diﬀerent from the elements discussed so far. Most important, biological neurons communicate with spikes, discrete events in continuous time. In this section, we will investigate the dynamics of spiking circuits and ask: In what dynamical regimes are neural microcircuits computationally powerful? We propose in this section a conceptual framework and new quantitative measures for the investigation of this question (see also Maass et al., 2005). In order to make this approach feasible, in spite of numerous unknowns regarding synaptic plasticity and the distribution of electrical and biochemical signals impinging on a cortical microcircuit, we make in the present ﬁrst step of this approach the following simplifying assumptions: 1. Particular neurons (“readout neurons”) learn via synaptic plasticity to extract speciﬁc information encoded in the spiking activity of neurons in the circuit. 2. We assume that the cortical microcircuit itself is highly recurrent, but that the impact of feedback that a readout neuron might send back into this circuit can be neglected.9 3. We assume that synaptic plasticity of readout neurons enables them to learn arbitrary linear transformations. More precisely, we assume that the input to such &n−1 readout neurons can be approximated by a term i=1 wi xi (t), where n − 1 is the number of presynaptic neurons, xi (t) results from the output spike train of the ith presynaptic neuron by ﬁltering it according to the low-pass ﬁltering property of the membrane of the readout neuron,10 and wi is the eﬃcacy of the synaptic connection. Thus wi xi (t) models the time course of the contribution of previous spikes from the ith presynaptic neuron to the membrane potential at the soma of this readout neuron. We will refer to the vector x(t) as the “circuit state at time t” (although it is really only that part of the circuit state which is directly observable by readout neurons). All microcircuit models that we consider are based on biological data for generic cortical microcircuits (as described in section 6.9.1), but have diﬀerent settings of their parameters.

Toward the Analysis of Biological Neural Systems

6.9.1

145

Models for Generic Cortical Microcircuits

Our empirical studies were performed on a large variety of models for generic cortical microcircuits (we refer to Maass et al., 2004, , for more detailed deﬁnitions and explanations). All circuit models consisted of leaky integrate-and-ﬁre neurons11 and biologically quite realistic models for dynamic synapses.12 Neurons (20% of which were randomly chosen to be inhibitory) were located on the grid points of a 3D grid of dimensions 6×6×15 with edges of unit length. The probability of a synaptic connection from neuron a to neuron b was proportional to exp(−D2 (a, b)/λ2 ), where D(a, b) is the Euclidean distance between a and b, and λ is a spatial connectivity constant (not to be confused with the λ parameter used by Langton). Synaptic eﬃciencies w were chosen randomly from distributions that reﬂect biological data (as in Maass et al., 2002), with a common scaling factor Wscale . 8

b

0.7

0

50

100 150 200 t [ms]

0

50

100 150 200 t [ms]

0.035 0.03 0.025

4 2

a Wscale

6.9

1 0.7 0.5 0.3

3

0.65

2 1

SD

0.6

0.1 0.05 0.5

1 1.4 2 λ

3 4

6 8

Figure 6.5 Performance of diﬀerent types of neural microcircuit models for classiﬁcation of spike patterns. (a) In the top row are two examples of the 80 spike patterns that were used (each consisting of 4 Poisson spike trains at 20 Hz over 200 ms), and in the bottom row are examples of noisy variations (Gaussian jitter with SD 10 ms) of these spike patterns which were used as circuit inputs. (b) Fraction of examples (for 200 test examples) that were correctly classiﬁed by a linear readout (trained by linear regression with 500 training examples). Results are shown for 90 diﬀerent types of neural microcircuits C with λ varying on the x-axis and Wscale on the y -axis (20 randomly drawn circuits and 20 target classiﬁcation functions randomly drawn from the set of 280 possible classiﬁcation functions were tested for each of the 90 diﬀerent circuit types, and resulting correctness rates were averaged). Circles mark three speciﬁc choices of λ, Wscale pairs for comparison with other ﬁgures; see ﬁg. 6.6. The standard deviation of the result is shown in the inset on the upper right.

Linear readouts from circuits with n − 1 neurons were assumed to compute a &n−1 weighted sum i=1 wi xi (t) + w0 (see section 6.9). In order to simplify notation we assume that the vector x(t) contains an additional constant component x0 (t) = 1, so &n−1 that one can write w ·x(t) instead of i=1 wi xi (t)+w0 . In the case of classiﬁcation tasks we assume that the readout outputs 1 if w · x(t) ≥ 0, and 0 otherwise.

146

What Makes a Dynamical System Computationally Powerful?

In order to investigate the inﬂuence of synaptic connectivity on computational performance, neural microcircuits were drawn from this distribution for 10 diﬀerent values of λ (which scales the number and average distance of synaptically connected neurons) and 9 diﬀerent values of Wscale (which scales the eﬃcacy of all synaptic connections). Twenty microcircuit models C were drawn for each of these 90 diﬀerent assignments of values to λ and Wscale . For each circuit a linear readout was trained to perform one (randomly chosen) out of 280 possible classiﬁcation tasks on noisy variations u of 80 ﬁxed spike patterns as circuit inputs u. See ﬁg. 6.5 for two examples of such spike patterns. The target performance of any such circuit was to output at time t = 200 ms the class (0 or 1) of the spike pattern from which the preceding circuit input had been generated (for some arbitrary partition of the 80 ﬁxed spike patterns into two classes). Each spike pattern u consisted of four Poisson spike trains over 200 ms. Performance results are shown in ﬁg. 6.5b for 90 diﬀerent types of neural microcircuit models. 6.9.2

Locating the Edge of Chaos in Neural Microcircuit Models

It turns out that the previously considered characterizations of the edge of chaos are not too successful in identifying those parameter values in the map of ﬁg. 6.5b that yield circuits with large computational power (Maass et al., 2005). The reason is that large initial state diﬀerences (as they are typically caused by diﬀerent spike input patterns) tend to yield for most values of the circuit parameters nonzero state diﬀerences not only while the online spike inputs are diﬀerent, but also long afterward when the online inputs agree during subsequent seconds (even if the random internal noise is identical in both trials). But if one applies the deﬁnition of the edge of chaos via Lyapunov exponents (see Kantz and Schreiber, 1997), the resulting edge of chaos lies for the previously introduced type of computations (classiﬁcation of noisy spike templates by a trained linear readout) in the region of the best computational performance (see the map in ﬁg. 6.5b, which is repeated for easier comparison in ﬁg. 6.6d). For this deﬁnition one looks for the exponent μ ∈ R that provides through the formula δΔT ≈ δ0 · eμΔT the best estimate of the state separation δΔT at time ΔT after the computation was started in two trials with an initial state diﬀerence δ0 . We generalize this analysis to the case with online input by choosing exactly the same online input (and the same random noise) during the intervening time interval of length ΔT , and by averaging the resulting state diﬀerences δΔT over many random choices of such online inputs (and internal noise). As in the classical case with oﬀ-line input it turns out to be essential to apply this estimate for δ0 → 0, since δΔT tends to saturate for each ﬁxed value δ0 . This can be seen in ﬁg. 6.6a, which shows results of this experiment for a δ0 that results from moving a single spike that occurs in the online input at time t = 1s by 0.5 ms. This experiment was repeated for three

Toward the Analysis of Biological Neural Systems

a

4

147

state separation for a spike displacement of 0.5 msec. at time t=1 sec.

circuit 3

0 0.04

circuit 2

0 0.001

0

1

2

3

3

0.5 ms

4

c

5 t [s]

6

scale

0

9

10

8

0.7

4

2 1

0.68

2

3

1 0.7 0.5 0.3

8

d

2 1

7

8 4

1 ms 2 ms

2

W

b

circuit 1

Wscale

0

Lyapunov exponent μ

6.9

μ=1

3

1 0.7 0.5 0.3

μ=0

0.66 0.64

2

0.62

1

0.6

−1 0.1 −2

1.9

2

2.1

λ

2.2

2.3

0.05 0.5

1 1.4

2

λ

3

4

6

0.58

0.1

μ=−1 8

0.05 0.5

0.56 1 1.4 2

λ

3 4

6 8

Analysis of small input diﬀerences for diﬀerent types of neural microcircuit models as speciﬁed in section 6.9.1. Each circuit C was tested for two arrays u and v of 4 input spike trains at 20 Hz over 10 s that diﬀered only in the timing of a single spike at time t = 1 s. (a) A spike at time t = 1 s was delayed by 0.5 ms. Temporal evolution of Euclidean diﬀerences between resulting circuit states xu (t) and xv (t) with 3 diﬀerent values of λ, Wscale according to the three points marked in panel c. For each parameter pair, the average state diﬀerence of 40 randomly drawn circuits is plotted. (b) Lyapunov exponents μ along a straight line between the points marked in panel c with diﬀerent delays of the delayed spike. The delay is denoted on the right of each line. The exponents were determined for the average state diﬀerence of 40 randomly drawn circuits. (c) Lyapunov exponents μ for 90 diﬀerent types of neural microcircuits C with λ varying on the x-axis and Wscale on the y-axis (the exponents were determined for the average state diﬀerence of 20 randomly drawn circuits for each parameter pair). A spike in u at time t = 1 s was delayed by 0.5 ms. The contour lines indicate where μ crosses the values −1, 0, and 1. (d) Computational performance of these circuits (same as ﬁg. 6.5b), shown for comparison with panel c. Figure 6.6

diﬀerent circuits with parameters chosen from the 3 locations marked on the map in ﬁg. 6.6c. By determining the best-ﬁtting μ for ΔT = 1.5s for three diﬀerent values of δ0 (resulting from moving a spike at time t = 1s by 0.5, 1, 2 ms) one gets the dependence of this Lyapunov exponent on the circuit parameter λ shown in ﬁg. 6.6b (for values of λ and Wscale on a straight line between the points marked in the map of ﬁg. 6.6c). The middle curve in ﬁg. 6.6c shows for which values of λ and Wscale the Lyapunov exponent is estimated to have the value 0. By comparing it with those regions on this parameter map where the circuits have the largest computational power (for the classiﬁcation of noisy spike patterns, see ﬁg. 6.6d), one sees that this line runs through those regions which yield the largest computational power for these computations. We refer to Mayor and Gerstner (2005) for other recent work

148

What Makes a Dynamical System Computationally Powerful?

on studies of the relationship between the edge of chaos and the computational power of spiking neural circuit models. Although this estimated edge of chaos coincides quite well with points of best computational performance, it remains an unsatisfactory tool for predicting parameter regions with large computational power for three reasons: 1. Since the edge of chaos is a lower-dimensional manifold in a parameter map (in this case a curve in a 2D map), it cannot predict the (full dimensional) regions of a parameter map with high computational performance (e.g., the regions with light shading in ﬁg. 6.5b). 2. The edge of chaos does not provide intrinsic reasons why points of the parameter map yield small or large computational power. 3. It turns out that in some parameter maps diﬀerent regions provide circuits with large computational power for diﬀerent classes of computational tasks (as shown in Maass et al. (2005), for computations on spike patterns and for computations with ﬁring rates). But the edge of chaos can at best single out peaks for one of these regions. Hence it cannot possibly be used as a universal predictor of maximal computational power for all types of computational tasks. These three deﬁciencies suggest that one has to think about diﬀerent strategies to approach the central question of this chapter. The strategy we will pursue in the following is based on the assumption that the computational function of cortical microcircuits is not fully genetically encoded, but rather emerges through various forms of plasticity (“learning”) in response to the actual distribution of signals that the neural microcircuit receives from its environment. From this perspective the question about the computational function of cortical microcircuits C turns into the following questions: What functions (i.e., maps from circuit inputs to circuit outputs) can the circuit C learn to compute? How well can the circuit C generalize a speciﬁc learned computational function to new inputs? In the following, we propose quantitative criteria based on rigorous mathematical principles for evaluating a neural microcircuit C with regard to these two questions. We will compare in section 6.9.5 the predictions of these quantitative measures with the actual computational performance achieved by neural microcircuit models as discussed in section 6.9.1. 6.9.3

A Measure for the Kernel-Quality

One expects from a powerful computational system that signiﬁcantly diﬀerent input streams cause signiﬁcantly diﬀerent internal states and hence may lead to diﬀerent outputs. Most real-world computational tasks require that the circuit give a desired output not just for two, but for a fairly large number m of signiﬁcantly diﬀerent

6.9

Toward the Analysis of Biological Neural Systems

149

inputs. One could of course test whether a circuit C can separate each of the m 2 pairs of such inputs. But even if the circuit can do this, we do not know whether a neural readout from such circuit would be able to produce given target outputs for these m inputs. Therefore we propose here the linear separation property as a more suitable quantitative measure for evaluating the computational power of a neural microcircuit (or more precisely, the kernel quality of a circuit; see below). To evaluate the linear separation property of a circuit C for m diﬀerent inputs u1 , . . . , um (which are in the following always functions of time, i.e., input streams such as, for example, multiple spike trains) we compute the rank of the n × m matrix M whose columns are the circuit states xui (t0 ) that result at some ﬁxed time t0 for the preceding input stream ui . If this matrix has rank m, then it is guaranteed that any given assignment of target outputs yi ∈ R at time t0 for the inputs ui can be implemented by this circuit C (in combination with a linear readout). In particular, each of the 2m possible binary classiﬁcations of these m inputs can then be carried out by a linear readout from this ﬁxed circuit C. Obviously such insight is much more informative than a demonstration that some particular classiﬁcation task can be carried out by such circuit C. If the rank of this matrix M has a value r < m, then this value r can still be viewed as a measure for the computational power of this circuit C, since r is the number of “degrees of freedom” that a linear readout has in assigning target outputs yi to these inputs ui (in a way which can be made mathematically precise with concepts of linear algebra). Note that this rank measure for the linear separation property of a circuit C may be viewed as an empirical measure for its kernel quality, i.e., for the complexity and diversity of nonlinear operations carried out by C on its input stream in order to boost the classiﬁcation power of a subsequent linear decision hyperplane (see Vapnik, 1998). 6.9.4

A Measure for the Generalization-Capability

Obviously the preceding measure addresses only one component of the computational performance of a neural circuit C. Another component is its capability to generalize a learned computational function to new inputs. Mathematical criteria for generalization capability are derived by Vapnik (1998) (see ch. 4 in Cherkassky and Mulier, 1998, for a compact account of results relevant for our arguments). According to this mathematical theory one can quantify the generalization capability of any learning device in terms of the VC-dimension of the class H of hypotheses that are potentially used by that learning device.13 More precisely: if VC-dimension (H) is substantially smaller than the size of the training set Strain , one can prove that this learning device generalizes well, in the sense that the hypothesis (or input-output map) produced by this learning device is likely to have for new examples an error rate which is not much higher than its error rate on Strain , provided that the new examples are drawn from the same distribution as the training examples (see eq. 4.22 in Cherkassky and Mulier, 1998). We apply this mathematical framework to the class HC of all maps from a set

150

What Makes a Dynamical System Computationally Powerful?

Suniv of inputs u into {0, 1} that can be implemented by a circuit C. More precisely: HC consists of all maps from Suniv into {0, 1} that could possibly be implemented by a linear readout from circuit C with ﬁxed internal parameters (weights etc.) but arbitrary weights w ∈ Rn of the readout (which classiﬁes the circuit input u as belonging to class 1 if w · xu (t0 ) ≥ 0, and to class 0 if w · xu (t0 ) < 0). Whereas it is very diﬃcult to achieve tight theoretical bounds for the VCdimension of even much simpler neural circuits (see Bartlett and Maass, 2003), one can eﬃciently estimate the VC-dimension of the class HC that arises in our context for some ﬁnite ensemble Suniv of inputs (that contains all examples used for training or testing) by using the following mathematical result (which can be proved with the help of Radon’s theorem): Theorem 6.1 Let r be the rank of the n × s matrix consisting of the s vectors xu (t0 ) for all inputs u in Suniv (we assume that Suniv is ﬁnite and contains s inputs). Then r ≤ VC-dimension(HC ) ≤ r + 1. Proof Idea. Fix some inputs u1 , . . . , ur in Suniv so that the resulting r circuit states xui (t0 ) are linearly independent. The ﬁrst inequality is obvious since this set of r linearly independent vectors can be shattered by linear readouts from the circuit C. To prove the second inequality one assumes for a contradiction that there exists a set v1 , . . . , vr+2 of r+2 inputs in Suniv so that the corresponding set of r+2 circuit states xvi (t0 ) can be shattered by linear readouts. This set M of r +2 vectors is contained in the r-dimensional space spanned by the linearly independent vectors xu1 (t0 ), . . . , xur (t0 ). Therefore Radon’s theorem implies that M can be partitioned into disjoint subsets M1 , M2 whose convex hulls intersect. Since these sets M1 , M2 cannot be separated by a hyperplane, it is clear that no linear readout exists that assigns value 1 to points in M1 and value 0 to points in M2 . Hence M = M1 ∪ M2 is not shattered by linear readouts, a contradiction to our assumption. We propose to use the rank r deﬁned in theorem 6.1 as an estimate of VCdimension(HC ), and hence as a measure that informs us about the generalization capability of a neural microcircuit C. It is assumed here that the set Suniv contains many noisy variations of the same input signal, since otherwise learning with a randomly drawn training set Strain ⊆ Suniv has no chance to generalize to new noisy variations. Note that each family of computational tasks induces a particular notion of what aspects of the input are viewed as noise, and what input features are viewed as signals that carry information which is relevant for the target output for at least one of these computational tasks. For example, for computations on spike patterns some small jitter in the spike timing is viewed as noise. For computations on ﬁring rates even the sequence of interspike intervals and the temporal relations between spikes that arrive from diﬀerent input sources are viewed as noise, as long as these input spike trains represent the same ﬁring rates. An example for the former computational task was discussed in section 6.9.1. This task was to output at time t = 200 ms the class (0 or 1) of the spike pattern

Toward the Analysis of Biological Neural Systems

a

8

151

0.2

b

0.03

4

0.02

2

0.16

1 0.14

0.1 0.05 0.5

450

1 1.4 2 λ

3 4

6 8

SD

2 scale

3

0.025

0.18

W

1 0.7 0.5 0.3

8 4

2 Wscale

6.9

1 0.7 0.5 0.3

350

2

300

1

0

SD

250 200

0.1 0.05 0.5

20

400

3

1 1.4 2 λ

3 4

6 8

Measuring the generalization capability of neural microcircuit models. (a) Test error minus train error (error was measured as the fraction of examples that were misclassiﬁed) in the spike pattern classiﬁcation task discussed in section 6.9.1 for 90 diﬀerent types of neural microcircuits (as in ﬁg. 6.5b). The standard deviation is shown in the inset on the upper right. (b) Generalization capability for spike patterns: estimated VC-dimension of HC (for a set Suniv of inputs u consisting of 500 jittered versions of 4 spike patterns), for 90 diﬀerent circuit types (average over 20 circuits; for each circuit, the average over 5 diﬀerent sets of spike patterns was used). The standard deviation is shown in the inset on the upper right. See section 6.9.5 for details.

Figure 6.7

from which the preceding circuit input had been generated (for some arbitrary partition of the 80 ﬁxed spike patterns into two classes; see section 6.9.1). For a poorly generalizing network, the diﬀerence between train and test error is large. One would suppose that this diﬀerence becomes large as the network dynamics become more and more chaotic. This is indeed the case; see ﬁg. 6.7a. The transition is is pretty well predicted by the estimated VC-dimension of HC ; see ﬁg. 6.7b. 6.9.5

Evaluating the Inﬂuence of Synaptic Connectivity on Computational Performance

We now test the predictive quality of the two proposed measures for the computational power of a microcircuit on spike patterns. One should keep in mind that the proposed measures do not attempt to test the computational capability of a circuit for one particular computational task, but rather for any distribution on Suniv and for a very large (in general, inﬁnitely large) family of computational tasks that have in common only a particular bias regarding which aspects of the incoming spike trains may carry information that is relevant for the target output of computations, and which aspects should be viewed as noise. ﬁg. 6.8a explains why the lower left part of the parameter map in ﬁg. 6.5b is less suitable for any such computation, since there the kernel quality of the circuits is too low.14 Figure 6.8b explains why the upper right part of the parameter map in ﬁg. 6.5b is less suitable, since a higher VC-dimension (for a training set of ﬁxed size) entails poorer generalization capability. We are not aware of a theoretically founded way of combining both measures into a single value that predicts overall computational

152

What Makes a Dynamical System Computationally Powerful?

a

b

8

scale

8

450

4

450

4

2

400

2

400

2

350

1 0.7 0.5 0.3

350

1 0.7 0.5 0.3

1 0.7 0.5 0.3

W

c

8

4

300

300 250

250 0.1 0.05 0.5

200 1 1.4 2 λ

3 4

6 8

200

0.1 0.05 0.5

1 1.4 2 λ

3 4

6 8

10

20

5

3

15

2

10

SD

0

1 5

0.1 0.05 0.5

0 1 1.4 2 λ

3 4

6 8

Values of the proposed measures for computations on spike patterns. (a) Kernel quality for spike patterns of 90 diﬀerent circuit types (average over 20 circuits, mean SD = 13). (b) Generalization capability for spike patterns: estimated VC-dimension of HC (for a set Suniv of inputs u consisting of 500 jittered versions of 4 spike patterns), for 90 diﬀerent circuit types (same as ﬁg. 6.7b). (c) Diﬀerence of both measures (the standard deviation is shown in the inset on the upper right). This should be compared with actual computational performance plotted in ﬁg. 6.5b.

Figure 6.8

performance. But if one just takes the diﬀerence of both measures (after scaling each linearly into a common range [0,1]), then the resulting number (see ﬁg. 6.8c) predicts quite well which types of neural microcircuit models perform well for the particular computational tasks considered in Figure 6.5b.15 Results of further tests of the predictive power of these measures are reported in Maass et al. (2005). These tests have been applied there to a completely diﬀerent parameter map, and to diverse classes of computational tasks.

6.10

Conclusions The need to understand computational properties of complex dynamical systems is becoming more urgent. New experimental methods provide substantial insight into the inherent dynamics of the computationally most powerful classes of dynamical systems that are known: neural systems and gene regulation networks of biological organisms. More recent experimental data show that simplistic models for computations in such systems are not adequate, and that new concepts and methods have to be developed in order to understand their computational function. This short review has shown that several old ideas regarding computations in dynamical systems receive new relevance in this context, once they are transposed into a more realistic conceptual framework that allows us to analyze also online computations on continuous input streams. Another new ingredient is the investigation of the temporal evolution of information in a dynamical system from the perspective of models for the (biological) user of such information, i.e., from the perspective of neurons that receive inputs from several thousand presynaptic neurons in a neural circuit, and from the perspective of gene regulation mechanisms that involve thousands of transcription factors. Empirical evidence from the area of machine learning

NOTES

153

supports the hypothesis that readouts of this type, which are able to sample not just two or three, but thousands of coordinates of the state vector of a dynamical system, impose diﬀerent (and in general, less obvious) constraints on the dynamics of a high-dimensional dynamical system in order to employ such system for complex computations on continuous input streams. One might conjecture that unsupervised learning and regulation processes in neural systems adapt the system dynamics in such a way that these constraints are met. Hence, suitable variations of the idea of self-organized criticality may help us to gain a system-level perspective of synaptic plasticity and other adaptive processes in neural systems.

Notes

1 For the sake of completeness, we give here the deﬁnition of an attractor according to Strogatz (1994): He deﬁnes an attractor to be a closed set A with the following properties: (1) A is an invariant set: any trajectory x(t) that starts in A stays in A for all time. (2) A attracts an open set of initial conditions: there is an open set U containing A such that if x(t) ∈ U , then the distance from x(t) to A tends to zero as t → ∞. The largest such U is called the basin of attraction of A. (3) A is minimal: there is no proper subset of A that satisﬁes conditions 1 and 2. 2 In Kauﬀman (1969), the inactive state of a variable is denoted by 0. We use −1 here for reasons of notational consistency. 3 Here, x potentially depends on all other variables x , . . . , x . The function f can always be 1 i i N restricted such that xi is determined by the inputs to elements i only. 4 In Kauﬀman (1993), a state cycle is also called an attractor. Because such state cycles can be unstable to most minimal perturbations, we will avoid the term attractor here. 5 Wolfram considered automata with a neighborhood of ﬁve cells in total and two possible cell states. Since he considered “totalistic” transfer functions only (i.e., the function depends on the sum of the neighborhood states only), the number of possible transfer functions was small. Hence, the behavior of all such automata could be studied. 6 In the case of two-state cellular automata, high λ values imply that most state transitions map to the single nonquiescent state that leads to ordered dynamics. The most heterogeneous rules are found at λ = 0.5. 7 The model in Maass et al. (2002) was introduced in the context of biologically inspired neural microcircuits. The network consisted of spiking neurons. In Jaeger (2002), the network consisted of sigmoidal neurons. 8 The delayed 3-bit parity of an input signal u(·) is given by P ARIT Y (u(t−τ ), u(t−τ −1), u(t− τ − 2)) for delays τ > 0. The function P ARIT Y outputs 1 if the number of inputs which assume the value u ¯ + 1 is odd and −1 otherwise. 9 This assumption is best justiﬁed if such readout neuron is located, for example, in another brain area that receives massive input from many neurons in this microcircuit and only has diﬀuse backward projection. But it is certainly problematic and should be addressed in future elaborations of the present approach. 10 One can be even more realistic and ﬁlter it also by a model for the short-term dynamics of the synapse into the readout neuron, but this turns out to make no diﬀerence for the analysis proposed in this chapter. dVm 11 Membrane voltage V m modeled by τm dt = −(Vm −Vresting )+Rm ·(Isyn (t)+Ibackground + Inoise ), where τm = 30 ms is the membrane time constant, Isyn models synaptic inputs from other neurons in the circuits, Ibackground models a constant unspeciﬁc background input, and Inoise models noise in the input. The membrane resistance Rm was chosen as 1M Ω. 12 Short-term synaptic dynamics was modeled according to Markram et al. (1998), with distributions of synaptic parameters U (initial release probability), D (time constant for depression), F (time constant for facilitation) chosen to reﬂect empirical data (see Maass et al., 2002, for details). 13 The VC-dimension (of a class H of maps H from some universe S univ of inputs into {0, 1}) is deﬁned as the size of the largest subset S ⊆ Suniv which can be shattered by H. One says that S ⊆ Suniv is shattered by H if for every map f : S → {0, 1} there exists a map H in H such that

154

NOTES H(u) = f (u) for all u ∈ S (this means that every possible binary classiﬁcation of the inputs u ∈ S can be carried out by some hypothesis H in H). 14 The rank of the matrix consisting of 500 circuit states x (t) for t = 200 ms was computed u for 500 spike patterns over 200 ms as described in section 6.9.3; see ﬁg. 6.5a. For each circuit, the average over ﬁve diﬀerent sets of spike patterns was used. 15 Similar results arise if one records the analog values of the circuit states with a limited precision of, say, 1%.

7

A Variational Principle for Graphical Models

Martin J. Wainwright and Michael I. Jordan

Graphical models bring together graph theory and probability theory in a powerful formalism for multivariate statistical modeling. In statistical signal processing— as well as in related ﬁelds such as communication theory, control theory, and bioinformatics—statistical models have long been formulated in terms of graphs, and algorithms for computing basic statistical quantities such as likelihoods and marginal probabilities have often been expressed in terms of recursions operating on these graphs. Examples include hidden Markov models, Markov random ﬁelds, the forward-backward algorithm, and Kalman ﬁltering (Kailath et al., 2000; Pearl, 1988; Rabiner and Juang, 1993). These ideas can be understood, uniﬁed, and generalized within the formalism of graphical models. Indeed, graphical models provide a natural framework for formulating variations on these classical architectures, and for exploring entirely new families of statistical models. The recursive algorithms cited above are all instances of a general recursive algorithm known as the junction tree algorithm (Lauritzen and Spiegelhalter, 1988). The junction tree algorithm takes advantage of factorization properties of the joint probability distribution that are encoded by the pattern of missing edges in a graphical model. For suitably sparse graphs, the junction tree algorithm provides a systematic and practical solution to the general problem of computing likelihoods and other statistical quantities associated with a graphical model. Unfortunately, many graphical models of practical interest are not “suitably sparse,” so that the junction tree algorithm no longer provides a viable computational solution to the problem of computing marginal probabilities and other expectations. One popular source of methods for attempting to cope with such cases is the Markov chain Monte Carlo (MCMC) framework, and indeed there is a signiﬁcant literature on the application of MCMC methods to graphical models (Besag and Green, 1993; Gilks et al., 1996). However, MCMC methods can be overly slow for practical applications in ﬁelds such as signal processing, and there has been signiﬁcant interest in developing faster approximation techniques. The class of variational methods provides an alternative approach to computing approximate marginal probabilities and expectations in graphical models. Roughly

156

A Variational Principle for Graphical Models

speaking, a variational method is based on casting a quantity of interest (e.g., a likelihood) as the solution to an optimization problem, and then solving a perturbed version of this optimization problem. Examples of variational methods for computing approximate marginal probabilities and expectations include the “loopy” form of the belief propagation or sum-product algorithm (McEliece et al., 1998; Yedidia et al., 2001) as well as a variety of so-called mean ﬁeld algorithms (Jordan et al., 1999; Zhang, 1996). Our principal goal in this chapter is to give a mathematically precise and computationally oriented meaning to the term variational in the setting of graphical models—a meaning that reposes on basic concepts in the ﬁeld of convex analysis (Rockafellar, 1970). Compared to the somewhat loose deﬁnition of variational that is often encountered in the graphical models literature, our characterization has certain advantages, both in clarifying the relationships among existing algorithms, and in permitting fuller exploitation of the general tools of convex optimization in the design and analysis of new algorithms. Brieﬂy, the core issues can be summarized as follows. In order to deﬁne an optimization problem, it is necessary to specify both a cost function to be optimized, and a constraint set over which the optimization takes place. Reﬂecting the origins of most existing variational methods in statistical physics, developers of variational methods generally express the function to be optimized as a “free energy,” meaning a functional on probability distributions. The set to be optimized over is often left implicit, but it is generally taken to be the set of all probability distributions. A basic exercise in constrained optimization yields the Boltzmann distribution as the general form of the solution. While useful, this derivation has two shortcomings. First, the optimizing argument is a joint probability distribution, not a set of marginal probabilities or expectations. Thus, the derivation leaves us short of our goal of a variational representation for computing marginal probabilities. Second, the set of all probability distributions is a very large set, and formulating the optimization problem in terms of such a set provides little guidance in the design of computationally eﬃcient approximations. Our approach addresses both of these issues. The key insight is to formulate the optimization problem not over the set of all probability distributions, but rather over a ﬁnite-dimensional set M of realizable mean parameters. This set is convex in general, and it is a polytope in the case of discrete random variables. There are several natural ways to approximate this convex set, and a broad range of extant algorithms turn out to involve particular choices of approximations. In particular, as we will show, the “loopy” form of the sum-product or belief propagation algorithm involves an outer approximation to M, whereas the more classical mean ﬁeld algorithms, on the other hand, involve an inner approximation to the set M. The characterization of belief propagation as an optimization over an outer approximation of a certain convex set does not arise readily within the standard formulation of variational methods. Indeed, given an optimization over all possible probability distributions, it is diﬃcult to see how to move “outside” of such a set. Similarly, while the standard formulation does provide some insight into the diﬀerences between belief propagation and mean ﬁeld methods (in that

7.1

Background

157

they optimize diﬀerent “free energies”), the standard formulation does not involve the set M, and hence does not reveal the fundamental diﬀerence in terms of outer versus inner approximations. The core of the chapter is a variational characterization of the problem solved by the junction tree algorithm—that of computing exact marginal probabilities and expectations associated with subsets of nodes in a graphical model. These probabilities are obtained as the maximizing arguments of an optimization over the set M. Perhaps surprisingly, this problem is a convex optimization problem for a broad class of graphical models. With this characterization in hand, we show how variational methods arise as “relaxations”—that is, simpliﬁed optimization problems that involve some approximation of the constraint set, the cost function, or both. We show how a variety of standard variational methods, ranging from classical mean-ﬁeld to cluster variational methods, ﬁt within this framework. We also discuss new methods that emerge from this framework, including a relaxation based on semideﬁnite constraints and a link between reweighted forms of the maxproduct algorithm and linear programming. The remainder of the chapter is organized as follows. The ﬁrst two sections are devoted to basics: section 7.1 provides an overview of graphical models and section 7.2 is devoted to a brief discussion of exponential families. In section 7.3, we develop a general variational representation for computing marginal probabilities and expectations in exponential families. section 7.4 illustrates how various exact methods can be understood from this perspective. The rest of the chapter— sections 7.5 through 7.7—is devoted to the exploration of various relaxations of this exact variational principle, which in turn yield various algorithms for computing approximations to marginal probabilities and other expectations.

7.1

Background 7.1.1

Graphical Models

A graphical model consists of a collection of probability distributions that factorize according to the structure of an underlying graph. A graph G = (V, E) is formed by a collection of vertices V and a collection of edges E. An edge consists of a pair of vertices, and may either be directed or undirected. Associated with each vertex s ∈ V is a random variable xs taking values in some set Xs , which may either be continuous (e.g., Xs = R) or discrete (e.g., Xs = {0, 1, . . . , m − 1}). For any subset A of the vertex set V , we deﬁne xA := {xs | s ∈ A}. Directed Graphical Models In the directed case, each edge is directed from parent to child. We let π(s) denote the set of all parents of given node s ∈ V . (If s has no parents, then the set π(s) should be understood to be empty.) With this notation,

158

A Variational Principle for Graphical Models

a directed graphical model consists of a collection of probability distributions that factorize in the following way: p(xs | xπ(s) ). (7.1) p(x) = s∈V

It can be veriﬁed that our use of notation is consistent, in that p(xs | xπ(s) ) is, in fact, the conditional distribution for the global distribution p(x) thus deﬁned. Undirected Graphical Models In the undirected case, the probability distribution factorizes according to functions deﬁned on the cliques of the graph (i.e., fully connected subsets of V ). In particular, associated with each clique C is a compatibility function ψC : X n → R+ that depends only on the subvector xC . With this notation, an undirected graphical model (also known as a Markov random ﬁeld) consists of a collection of distributions that factorize as p(x) =

1 ψC (xC ), Z

(7.2)

C

where the product is taken over all cliques of the graph. The quantity Z is a constant chosen to ensure that the distribution is normalized. In contrast to the directed case 7.1, in general the compatibility functions ψC need not have any obvious or direct relation to local marginal distributions. Families of probability distributions as deﬁned as in equation 7.1 or 7.2 also have a characterization in terms of conditional independencies among subsets of random variables. We will not use this characterization in this chapter, but refer the interested reader to Lauritzen (1996) for a full treatment. 7.1.2

Inference Problems and Exact Algorithms

Given a probability distribution p(·) deﬁned by a graphical model, our focus will be solving one or more of the following inference problems: 1. computing the likelihood; 2. computing the marginal distribution p(xA ) over a particular subset A ⊂ V of nodes; 3. computing the conditional distribution p(xA | xB ), for disjoint subsets A and B, where A ∪ B is in general a proper subset of V ; in the set arg maxx∈X n p(x)). 4. computing a mode of the density (i.e., an element x Problem 1 is a special case of problem 2, because the likelihood is the marginal probability of the observed data. The computation of a conditional probability in problem 3 is similar in that it also requires marginalization steps, an initial one to obtain the numerator p(xA , xB ), and a further step to obtain the denominator p(xB ). In contrast, the problem of computing modes stated in problem 4 is fundamentally diﬀerent, since it entails maximization rather than integration. Although problem 4 is not the main focus of this chapter, there are important connections

7.1

Background

159

between the problem of computing marginals and that of computing modes; these are discussed in section 7.7.2. To understand the challenges inherent in these inference problems, consider the case of a discrete random vector x ∈ X n , where Xs = {0, 1, . . . , m − 1} for each vertex s ∈ V . A naive approach to computing a marginal at a single node—say p(xs )—entails summing over all conﬁgurations of the form {x | xs = xs }. Since this set has mn−1 elements, it is clear that a brute-force approach will rapidly become intractable as n grows. Similarly, computing a mode entails solving an integer programming problem over an exponential number of conﬁgurations. For continuous random vectors, the problems are no easier1 and typically harder, since they require computing a large number of integrals. Both directed and undirected graphical models involve factorized expressions for joint probabilities, and it should come as no surprise that exact inference algorithms treat them in an essentially identical manner. Indeed, to permit a simple uniﬁed treatment of inference algorithms, it is convenient to convert directed models to undirected models and to work exclusively within the undirected formalism. Any directed graph can be converted, via a process known as moralization (Lauritzen and Spiegelhalter, 1988), to an undirected graph that—at least for the purposes of solving inference problems—is equivalent. Throughout the rest of the chapter, we assume that this transformation has been carried out. Message-passing on trees For graphs without cycles—also known as trees— these inference problems can be solved exactly by recursive “message-passing” algorithms of a dynamic programming nature, with a computational complexity that scales only linearly in the number of nodes. In particular, for the case of computing marginals, the dynamic programming solution takes the form of a general algorithm known as the sum-product algorithm, whereas for the problem of computing modes it takes the form of an analogous algorithm known as the maxproduct algorithm. Here we provide a brief description of these algorithms; further details can be found in various sources (Aji and McEliece, 2000; Kschischang and Frey, 1998; Lauritzen and Spiegelhalter, 1988; Loeliger, 2004). We begin by observing that the cliques of a tree-structured graph T = (V, E(T )) are simply the individual nodes and edges. As a consequence, any treestructured graphical model has the following factorization: 1 ψs (xs ) ψst (xs , xt ). (7.3) p(x) = Z s∈V

(s,t)∈E(T )

Here we describe how the sum-product algorithm computes the marginal distribu& tion μs (xs ) := {x | x =xs } p(x) for every node of a tree-structured graph. We will s focus in detail on the case of discrete random variables, with the understanding that the computations carry over (at least in principle) to the continuous case by replacing sums with integrals. Sum-product algorithm The essential principle underlying the sum-product algorithm on trees is divide and conquer: we solve a large problem by breaking it

160

A Variational Principle for Graphical Models

down into a sequence of simpler problems. The tree itself provides a natural way to break down the problem as follows. For an arbitrary s ∈ V , consider the set of its neighbors N (s) = {u ∈ V | (s, u) ∈ E}. For each u ∈ N (s), let Tu = (Vu , Eu ) be the subgraph formed by the set of nodes (and edges joining them) that can be reached from u by paths that do not pass through node s. The key property of a tree is that each such subgraph Tu is again a tree, and Tu and Tv are disjoint for u = v. In this way, each vertex u ∈ N (s) can be viewed as the root of a subtree Tu , as illustrated in ﬁg. 7.1a. For each subtree Tt , we deﬁne xVt := {xu | u ∈ Vt }. Now consider the collection of terms in equation 7.3 associated with vertices or edges in Tt : collecting all of these terms yields a subproblem p(xVt ; Tt ) for this subtree. Now the conditional independence properties of a tree allow the computation of the marginal at node μs to be broken down into a product of the form ∗ μs (xs ) ∝ ψs (xs ) Mts (xs ). (7.4) t∈N (s) ∗ Each term Mts (xs ) in this product is the result of performing a partial summation for the subproblem p(xVt ; Tt ) in the following way: ∗ Mts (xs ) = ψst (xs , xt ) p(xTt ; Tt ). (7.5) {xT | xs =xs } t

∗ (xs ) is again a tree-structured summation, For ﬁxed xs , the subproblem deﬁning Mts albeit involving a subtree Tt smaller than the original tree T . Therefore, it too can be broken down recursively in a similar fashion. In this way, the marginal at node s can be computed by a series of recursive updates. Rather than applying the procedure described above to each node separately, the sum-product algorithm computes the marginals for all nodes simultaneously and in parallel. At each iteration, each node t passes a “message” to each of its neighbors u ∈ N (t). This message, which we denote by Mtu (xu ), is a function of the possible states xu ∈ Xu (i.e., a vector of length |Xu | for discrete random variables). On the full graph, there are a total of 2|E| messages, one for each direction of each edge. This full collection of messages is updated, typically in parallel, according to the following recursion: 3 2 Mut (xt ) , (7.6) ψst (xs , xt ) ψt (xt ) Mts (xs ) ← κ xt

u∈N (t)/s

where κ > 0 is a normalization constant. It can be shown (Pearl, 1988) that for tree-structured graphs, iterates generated by the update 7.6 will converge to ∗ ∗ , Mts , (s, t) ∈ E} after a ﬁnite number of a unique ﬁxed point M ∗ = {Mst ∗ iterations. Moreover, component Mts of this ﬁxed point is precisely equal, up to a normalization constant, to the subproblem deﬁned in equation 7.5, which justiﬁes our abuse of notation post hoc. Since the ﬁxed point M ∗ speciﬁes the solution to all of the subproblems, the marginal μs at every node s ∈ V can be computed easily via equation 7.4.

7.1

Background

161

Max-product algorithm Suppose that the summation in the update 7.6 is replaced by a maximization. The resulting max-product algorithm solves the problem of ﬁnding a mode of a tree-structured distribution p(x). In this sense, it represents a generalization of the Viterbi algorithm (Forney, 1973) from chains to arbitrary tree-structured graphs. More speciﬁcally, the max-product updates will converge to another unique ﬁxed point M ∗ —distinct, of course, from the sumproduct ﬁxed point. This ﬁxed point can be used to compute the max-marginal νs (xs ) := max{x | xs =xs } p(x ) at each node of the graph, in an analogous way to the computation of ordinary sum-marginals. Given these max-marginals, it is ∈ arg maxx p(x) of the distribution (Dawid, straightforward to compute a mode x 1992; Wainwright et al., 2004). More generally, updates of this form apply to arbitrary commutative semirings on tree-structured graphs (Aji and McEliece, 2000; Dawid, 1992). The pairs “sum-product” and “max-product” are two particular examples of such an algebraic structure. Junction Tree Representation We have seen that inference problems on trees can be solved exactly by recursive message-passing algorithms. Given a graph with cycles, a natural idea is to cluster its nodes so as to form a clique tree—that is, an acyclic graph whose nodes are formed by cliques of G. Having done so, it is tempting to simply apply a standard algorithm for inference on trees. However, the clique tree must satisfy an additional restriction so as to ensure consistency of these computations. In particular, since a given vertex s ∈ V may appear in multiple cliques (say C1 and C2 ), what is required is a mechanism for enforcing consistency among the diﬀerent appearances of the random variable xs . In order to enforce consistency, it turns out to be necessary to restrict attention to those clique trees that satisfy a particular graph-theoretic property. In particular, we say that a clique tree satisﬁes the running intersection property if for any two clique nodes C1 and C2 , all nodes on the unique path joining them contain the intersection C1 ∩ C2 . Any clique tree with this property is known as a junction tree. For what type of graphs can one build junction trees? An important result in graph theory asserts that a graph G has a junction tree if and only if it is triangulated.2 This result underlies the junction tree algorithm (Lauritzen and Spiegelhalter, 1988) for exact inference on arbitrary graphs, which consists of the following three steps: Step 1: Given a graph with cycles G, triangulate it by adding edges as necessary. Step 2: Form a junction tree associated with the triangulated graph. Step 3: Run a tree inference algorithm on the junction tree. We illustrate these basic steps with an example. Example 7.1 Consider the 3 × 3 grid shown in the top panel of ﬁg. 7.1b. The ﬁrst step is to form a triangulated version, as shown in the bottom panel of ﬁg. 7.1b. Note that the graph would not be triangulated if the additional edge joining nodes 2 and 8

162

A Variational Principle for Graphical Models

#treegrt# #t#

#u# #treegru#

1

2

3

4

5

6

1 2 4

2 3 6

#w# #treegrw

2 6

2 4

#s#

2 4 5 8

#v# #treegrv#

7 1

2

4

5

7

(a)

8

8

(b)

9 3 6

2 5 8

2 5 6 8

4 8

6 8

4 7 8

6 8 9

9

(c)

(a): Decomposition of a tree, rooted at node s, into subtrees. Each neighbor (e.g., u) of node s is the root of a subtree (e.g., Tu ). Subtrees Tu and Tv , for t = u, are disconnected when node s is removed from the graph. (b), (c) Illustration of junction tree construction. Top panel in (b) shows original graph: a 3×3 grid. Bottom panel in (b) shows triangulated version of original graph. Note the two 4-cliques in the middle. (c) Corresponding junction tree for triangulated graph in (b), with maximal cliques depicted within ellipses. The rectangles are separator sets; these are intersections of neighboring cliques. Figure 7.1

were not present. Without this edge, the 4-cycle (2 − 4 − 8 − 6 − 2) would lack a chord. Figure 7.1c shows a junction tree associated with this triangulated graph, in which circles represent maximal cliques (i.e., fully connected subsets of nodes that cannot be augmented with an additional node and remain fully connected), and boxes represent separator sets (intersections of cliques adjacent in the junction tree). ♦ An important by-product of the junction tree construction is an alternative representation of the probability distribution deﬁned by a graphical model. Let C denote the set of all maximal cliques in the triangulated graph, and deﬁne S as the set of all separator sets in the junction tree. For each separator set S ∈ S, let d(S) denote the number of maximal cliques to which it is adjacent. The junction tree framework guarantees that the distribution p(·) factorizes in the form μC (xC ) , (7.7) p(x) = C∈C d(S)−1 S∈S [μS (xS )] where μC and μS are the marginal distributions over the cliques and separator sets respectively. Observe that unlike the representation of equation 7.2, the decomposition of equation 7.7 is directly in terms of marginal distributions, and does not require a normalization constant (i.e., Z = 1).

7.1

Background

163

Example 7.2 Markov Chain Consider the Markov chain p(x1 , x2 , x3 ) = p(x1 ) p(x2 | x1 ) p(x3 | x2 ). The cliques in a graphical model representation are {1, 2} and {2, 3}, with separator {2}. Clearly the distribution cannot be written as the product of marginals involving only the cliques. However, if we include the separator, it can be factorized in terms of its 2 )p(x2 ,x3 ) . ♦ marginals—viz., p(x1 , x2 , x3 ) = p(x1 ,xp(x 2) To anticipate the development in the sequel, it is helpful to consider the following “inverse” perspective on the junction tree representation. Suppose that we are given a set of functions τC (xC ) and τS (xS ) associated with the cliques and separator sets in the junction tree. What conditions are necessary to ensure that these functions are valid marginals for some distribution? Suppose that the functions {τS , τC } are locally consistent in the following sense: τS (xS ) = 1 normalization (7.8a)

xS

τC (xC ) = τS (xS )

marginalization .

(7.8b)

{xC | xS =xS }

The essence of the junction tree theory described above is that such local consistency is both necessary and suﬃcient to ensure that these functions are valid marginals for some distribution. Finally, turning to the computational complexity of the junction tree algorithm, the computational cost grows exponentially in the size of the maximal clique in the junction tree. The size of the maximal clique over all possible triangulations of a graph deﬁnes an important graph-theoretic quantity known as the treewidth of the graph. Thus, the complexity of the junction tree algorithm is exponential in the treewidth. For certain classes of graphs, including chains and trees, the treewidth is small and the junction tree algorithm provides an eﬀective solution to inference problems. Such families include many well-known graphical model architectures, and the junction tree algorithm subsumes many classical recursive algorithms, including the forward-backward algorithms for hidden Markov models (Rabiner and Juang, 1993), the Kalman ﬁltering-smoothing algorithms for state-space models (Kailath et al., 2000), and the pruning and peeling algorithms from computational genetics (Felsenstein, 1981). On the other hand, there are many graphical models (e.g., grids) for which the treewidth is infeasibly large. Coping with such models requires leaving behind the junction tree framework, and turning to approximate inference algorithms. 7.1.3

Message-Passing Algorithms for Approximate Inference

In the remainder of the chapter, we present a general variational principle for graphical models that can be used to derive a class of techniques known as variational inference algorithms. To motivate our later development, we pause to give a high-level description of two variational inference algorithms, with the goal

164

A Variational Principle for Graphical Models

of highlighting their simple and intuitive nature. The ﬁrst variational algorithm that we consider is a so-called “loopy” form of the sum-product algorithm (also referred to as the belief propagation algorithm). Recall that the sum-product algorithm is designed as an exact method for trees; from a purely algorithmic point of view, however, there is nothing to prevent one from running the procedure on a graph with cycles. More speciﬁcally, the message updates 7.6 can be applied at a given node while ignoring the presence of cycles— essentially pretending that any given node is embedded in a tree. Intuitively, such an algorithm might be expected to work well if the graph is suitably “tree like,” such that the eﬀect of messages propagating around cycles is appropriately diminished. This algorithm is in fact widely used in various applications that involve signal processing, including image processing, computer vision, computational biology, and error-control coding. A second variational algorithm is the so-called naive mean ﬁeld algorithm. For concreteness, we describe it in application to a very special type of graphical model, known as the Ising model. The Ising model is a Markov random ﬁeld involving a binary random vector x ∈ {0, 1}n , in which pairs of adjacent nodes are coupled with a weight θst , and each node has an observation weight θs . (See examples 7.4 and 7.11 for a more detailed description of this model.) To motivate the mean ﬁeld updates, we consider the Gibbs sampler for this model, in which the basic update step is to choose a node s ∈ V randomly, and then to update the state of the associated random variable according to the conditional probability with neighboring states ﬁxed. More precisely, denoting by N (s) the neighbors of a node s ∈ V , and letting (p) xN (s) denote the state of the neighbors of s at iteration p, the Gibbs update for xs takes the following form: ' & (p) 1 if u ≤ {1 + exp[−(θs + t∈N (s) θst xt )]}−1 (p+1) , (7.9) = xs 0 otherwise where u is a sample from a uniform distribution U(0, 1). It is well known that this procedure generates a sequence of conﬁgurations that converge (in a stochastic sense) to a sample from the Ising model distribution. In a dense graph, such that the cardinality of N (s) is large, we might attempt to & (p) invoke a law of large numbers or some other concentration result for t∈N (s) θst xt . To the extent that such sums are concentrated, it might make sense to replace sample values with expectations, which motivates the following averaged version of equation 7.9: 2 3 0 1 −1 θst μt ) , (7.10) μs ← 1 + exp − (θs + t∈N (s)

in which μs denotes an estimate of the marginal probability p(xs = 1). Thus, rather than ﬂipping the random variable xs with a probability that depends on the state of its neighbors, we update a parameter μs using a deterministic function of the corresponding parameters {μt | t ∈ N (s)} at its neighbors. Equation 7.10

7.2

Graphical Models in Exponential Form

165

deﬁnes the naive mean ﬁeld algorithm for the Ising model, which can be viewed as a message-passing algorithm on the graph. At ﬁrst sight, message-passing algorithms of this nature might seem rather mysterious, and do raise some questions. Do the updates have ﬁxed points? Do the updates converge? What is the relation between the ﬁxed points and the exact quantities? The goal of the remainder of this chapter is to shed some light on such issues. Ultimately, we will see that a broad class of message-passing algorithms, including the mean ﬁeld updates, the sum-product and max-product algorithms, as well as various extensions of these methods can all be understood as solving either exact or approximate versions of a certain variational principle for graphical models.

7.2

Graphical Models in Exponential Form We begin by describing how many graphical models can be viewed as particular types of exponential families. Further background can be found in the books by Efron (1978) and Brown (1986). This exponential family representation is the foundation of our later development of the variational principle. 7.2.1

Maximum Entropy

One way in which to motivate exponential family representations of graphical models is through the principle of maximum entropy. The set up for this principle is as follows: given a collection of functions φα : X n → R, suppose that we have observed their expected values—that is, we have E[φα (x)] = μα 8

for all α ∈ I,

(7.11)

9

where μ = μα | α ∈ I is 8a real vector, 9 I is an index set, and d := |I| is the length of the vectors μ and φ := φα | α ∈ I . Our goal is use the observations to infer a full probability distribution. Let P denote the set of all probability distributions p over the random vector x. Since there are (in general) many distributions p ∈ P that are consistent with the observations 7.11, we need a principled method for choosing among them. The principle of maximum entropy is to choose the distribution pM E such that its entropy, deﬁned & as H(p) := − x∈X n p(x) log p(x), is maximized. More formally, the maximum entropy solution pM E is given by the following constrained optimization problem: pM E := arg max H(p) p∈P

subject to constraints 7.11.

(7.12)

One interpretation of this principle is as choosing the distribution with maximal uncertainty while remaining faithful to the data. Presuming that problem 7.12 is feasible, it is straightforward to show using a

166

A Variational Principle for Graphical Models

Lagrangian formulation that its optimal solution takes the form 8 9 p(x; θ) ∝ exp θθ φα (x) ,

(7.13)

α∈I

which corresponds to a distribution in exponential form. Note that the exponential decomposition 7.13 is analogous to the product decomposition 7.2 considered earlier. In the language of exponential families, the vector θ ∈ R8d is known as 9 the canonical parameter, and the collection of functions φ = φα | α ∈ I are known as suﬃcient statistics. In the context of our current presentation, each canonical parameter θα has a very concrete interpretation as the Lagrange multiplier associated with the constraint E[φα (x)] = μα . 7.2.2

Exponential Families

We now deﬁne exponential families in more generality. Any exponential family consists of a particular class of densities taken with respect to a ﬁxed base measure ν. The base measure is typically counting measure (as in our discrete example above), or Lebesgue measure (e.g., for Gaussian families). Throughout this chapter, we use a, b to denote the ordinary Euclidean inner product between two vectors a and b of the same dimension. Thus, for each ﬁxed x ∈ X n , the quantity θ, φ(x) inner product in Rd of the two vectors θ ∈ Rd and 9 8 is the Euclidean φ(x) = φα (x) | α ∈ I . With this notation, the exponential family associated with φ consists of the following parameterized collection of density functions: 8 9 p(x; θ) = exp θ, φ(x) − A(θ) . (7.14) The quantity A, known as the log partition function or cumulant generating function, is deﬁned by the integral:

A(θ) = log expθ, φ(x) ν(dx). (7.15) Xn

Presuming that the integral is ﬁnite, this deﬁnition ensures that p(x; θ) is properly normalized (i.e., X n p(x; θ)ν(dx) = 1). With the set of potentials φ ﬁxed, each parameter vector θ indexes a particular member p(x; θ) of the family. The canonical parameters θ of interest belong to the set Θ := {θ ∈ Rd | A(θ) < ∞}.

(7.16)

Throughout this chapter, we deal exclusively with regular exponential families, for which the set Θ is assumed to be open. We summarize for future reference some well-known properties of A:

7.2

Graphical Models in Exponential Form

167

Lemma 7.1 The cumulant generating function A is convex in terms of θ. Moreover, it is inﬁnitely diﬀerentiable on Θ, and its derivatives correspond to cumulants. As an important special case, the ﬁrst derivatives of A take the form

∂A = φα (x)p(x; θ)ν(dx) = Eθ [φα (x)], ∂θα Xn

(7.17)

and deﬁne a vector μ := Eθ [φ(x)] of mean parameters associated with the exponential family. There are important relations between the canonical and mean parameters, and many inference problems can be formulated in terms of the mean parameters. These correspondences and other properties of the cumulant generating function are fundamental to our development of a variational principle for solving inference problems. 7.2.3

Illustrative Examples

In order to illustrate these deﬁnitions, we now discuss some particular classes of graphical models that commonly arise in signal and image processing problems, and how they can be represented in exponential form. In particular, we will see that graphical structure is reﬂected in the choice of suﬃcient statistics, or equivalently in terms of constraints on the canonical parameter vector. We begin with an important case—the Gaussian Markov random ﬁeld (MRF)— which is widely used for modeling various types of imagery and spatial data (Luettgen et al., 1994; Szeliski, 1990). Example 7.3 Gaussian Markov Random Field Consider a graph G = (V, E), such as that illustrated in ﬁg. 7.2(a), and suppose that each vertex s ∈ V has an associated Gaussian random variable xs . Any such scalar Gaussian is a (2-dimensional) exponential family speciﬁed 8 by suﬃcient 9 statistics xs and x2s . Turning to the Gaussian random vector x := xs | s ∈ V , it has an exponential family representation in terms of the suﬃcient statistics {xs , x2s | s ∈ V9} ∪ 8 8 {xs xt | (s, t)9 ∈ E}, with associated canonical parameters θs , θss | s ∈ V ∪ θst | (s, t) ∈ E . Here the additional cross-terms xs xt allow for possible correlation between components xs and xt of the Gaussian random vector. Note that there are a total of d = 2n + |E| suﬃcient statistics. The suﬃcient statistics and parameters can be represented compactly as (n + 1) × (n + 1) symmetric matrices: ⎤ ⎡ 0 θ1 θ2 . . . θn ⎥ ⎢ ⎢ θ1 θ11 θ12 . . . θ1n ⎥ ⎥ ⎢ 1 ⎥ ⎢ (7.18) X = U (θ) := ⎢ θ2 θ21 θ22 . . . θ2n ⎥ 1 x ⎢ x .. .. .. .. ⎥ ⎢ .. ⎥ . . . . ⎦ ⎣. θn θn1 θn2 . . . θnn

168

A Variational Principle for Graphical Models

Graph−structured matrix 1

#1#

2

#2#

3

4

#3# #5#

5

#4#

(a)

1

2

3

4

5

(b)

(a) A simple Gaussian model based on a graph G with 5 vertices. (b) The adjacency matrix of the graph G in (a), which speciﬁes the sparsity pattern of the matrix Z(θ).

Figure 7.2

We use Z(θ) to denote the lower n × n block of U (θ); it is known as the precision matrix. We say that x forms a Gaussian Markov random ﬁeld if its probability density function decomposes according to the graph G = (V, E). In terms of our canonical parameterization, this condition translates to the requirement that / E. Alternatively stated, the precision matrix Z(θ) must θst = 0 whenever (s, t) ∈ have the same zero-pattern as the adjacency matrix of the graph, as illustrated in ﬁg. 7.2b. For any two symmetric matrices C and D, it is convenient to deﬁne the inner product C, D := trace(C D). Using this notation leads to a particularly compact representation of a Gaussian MRF: 8 9 p(x; θ) = exp U (θ), X − A(θ) , (7.19) 0 1 where A(θ) := log Rn exp U (θ), X dx is the log cumulant generating function. The integral deﬁning A(θ) is ﬁnite only if the n×n precision matrix Z(θ) is negative deﬁnite, so that the domain of A has the form Θ = {θ ∈ Rd | Z(θ) ≺ 0}. Note that the mean parameters in the Gaussian model have a clear interpretation. The singleton elements μs = Eθ [xs ] are simply the Gaussian mean, whereas ♦ the elements μss = Eθ [x2s ] and μst = Eθ [xs xt ] are second-order moments. Markov random ﬁelds involving discrete random variables also arise in many applications, including image processing, bioinformatics, and error-control coding (Durbin et al., 1998; Geman and Geman, 1984; Kschischang et al., 2001; Loeliger, 2004). As with the Gaussian case, this class of Markov random ﬁelds also has a natural exponential representation. Example 7.4 Multinomial Markov Random Field Suppose that each xs is a multinomial random variable, taking values in the space Xs =8{0, 1, . . . , m 9 s − 1}. In order to represent a Markov random ﬁeld over the vector x = xs | s ∈ V in exponential form, we now introduce a particular set of suﬃcient statistics that will be useful in what follows. For each j ∈ Xs , let I j (xs ) be an

7.2

Graphical Models in Exponential Form

169

indicator function for the event {xs = j}. Similarly, for each pair (j, k) ∈ Xs × Xt , let I jk (xs , xt ) be an indicator for the event {(xs , xt ) = (j, k)}. These building blocks yield the following set of suﬃcient statistics: 9 8 9 8 I j (xs ) | s ∈ V, j ∈ Xs ∪ I j (xs )I k (xt ) | (s, t) ∈ E, (j, k) ∈ Xs × Xt . (7.20) The corresponding canonical parameter θ has elements of the form 8 9 8 9 θ = θs;j | s ∈ V, j ∈ Xs ∪ θst;jk | (s, t) ∈ E, (j, k) ∈ Xs × Xt .

(7.21)

It is convenient to combine the canonical parameters and indicator functions using & the shorthand notation θs (xs ) := j∈Xs θs;j I j (xs ); the quantity θst (xs , xt ) can be deﬁned similarly. With this notation, a multinomial MRF with pairwise interactions can be written in exponential form as 9 8 θs (xs ) + θst (xs , xt ) − A(θ) , (7.22) p(x; θ) = exp s∈V

(s,t)∈E

where the cumulant generating function is given by the summation 8 9 A(θ) := log exp θs (xs ) + θst (xs , xt ) . x∈X n

s∈V

(s,t)∈E

In signal-processing applications of these models, the random vector x is often viewed as hidden or partially observed (for instance, corresponding to the correct segmentation of an image). Thus, it is frequently the case that the functions θs are determined by noisy observations, whereas the terms θst control the coupling between variables xs and xt that are adjacent on the graph (e.g., reﬂecting spatial continuity assumptions). See ﬁg. 7.3(a) for an illustration of such a multinomial MRF deﬁned on a two-dimensional lattice, which is a widely used model in statistical image processing (Geman and Geman, 1984). In the special case that Xs = {0, 1} for all s ∈ V , the family represeted by equation 7.22 is known as the Ising model. Note that the mean parameters associated with this model correspond to particular marginal probabilities. For instance, the mean parameters associated with vertex s have the form μs;j = Eθ [I j (xs )] = p(xs = j; θ), and the mean parameters μst associated with edge (s, t) have an analogous interpretation as pairwise marginal values. ♦ Example 7.5 Hidden Markov Model A very important special case of the multinomial MRF is the hidden Markov model (HMM), which is a chain-structured graphical model widely used for the modeling of time series and other one-dimensional signals. It is conventional in the 9 8 HMM literature to refer to the multinomial random variables x = xs | s ∈ V as “state variables.” As illustrated in ﬁg. 7.3b, the edge set E deﬁnes a chain

170

A Variational Principle for Graphical Models

#cedge#

#cvar#

#xs#

#pmix#

#ecoup# #x1#

#x2#

#x3#

#x4#

#x5#

#pcond#

#esing#

#y1#

(a)

#y2#

#y3#

(b)

#y4#

#y5#

#zs# (c)

Figure 7.3 (a) A multinomial MRF on a 2D lattice model. (b) A hidden Markov model (HMM) is a special case of a multinomial MRF for a chain-structured graph. (c) The graphical representation of a scalar Gaussian mixture model: the multinomial xs indexes components in the mixture, and ys is conditionally Gaussian (with exponential parameters γs ) given the mixture component xs .

linking the state variables. The parameters θst (xs , xt ) deﬁne the state transition matrix; if this transition matrix is the same for all pairs s and t, then we have a homogeneous Markov chain. Associated with each multinomial state variable xs is a noisy observation ys , deﬁned by the conditional probability distribution p(ys |xs ). If we condition on the observed value of ys , this conditional probability is simply a function of xs , which we denote by θs (xs ). Given these deﬁnitions, equation 7.22 describes the conditional probability distribution p(x | y) for the HMM. In ﬁg. 7.3b, this conditioning is captured by shading the corresponding nodes in the graph. Note that the cumulant generating function A(θ) is, in fact, equal to the log likelihood of the observed data. ♦ Graphical models are not limited to cases in which the random variables at each node belong to the same exponential family. More generally, we can consider heterogeneous combinations of exponential family members. A very natural example, which combines the two previous types of graphical model, is that of a Gaussian mixture model. Such mixture models are widely used in modeling various classes of data, including natural images, speech signals, and ﬁnancial time series data; see the book by Titterington et al. (1986) for further background. Example 7.6 Mixture Model As shown in ﬁg. 7.3c, a scalar mixture model has a very simple graphical interpretation. In particular, let xs be a multinomial variable, taking values in Xs = {0, 1, 2, . . . , ms − 1}, speciﬁed in exponential parameter form with a function θs (xs ). The role of xs is to specify the choice of mixture component in the mixture model, so that our mixture model has ms components in total. We now let ys be conditionally Gaussian given xs , so that the conditional distribution p(ys | xs ; γs ) can be written in exponential family form with canonical parameters γs that are a function of xs . Overall, the pair (xs , ys ) form a very simple graphical model in exponential form, as shown in ﬁg. 7.3c.

7.3

An Exact Variational Principle for Inference

171

The pair (xs , ys ) serves a basic block for building more sophisticated graphical models. For example, one model is based on assuming that the mixture vector x is a multinomial MRF deﬁned on an underlying graph G = (V, E), whereas the components of y are conditionally independent given the mixture vector x. These assumptions lead to an exponential family p(y, x; θ, γ) of the form 1 p(ys | xs ; γs ) exp θs (xs ) + θst (xs , xt ) . (7.23) s∈V

s∈V

(s,t)∈E

For tree-structured graphs, Crouse et al. (1998) have applied this type of mixture model to applications in wavelet-based signal processing. ♦ This type of mixture model is a particular example of a broad class of graphical models that involve heterogeneous combinations of exponential family members (e.g., hierarchical Bayesian models).

7.3

An Exact Variational Principle for Inference With this set up, we can now rephrase inference problems in the language of exponential families. In particular, this chapter focuses primarily on the following two problems: computing the cumulant generating function A(θ), computing the vector of mean parameters μ := Eθ [φ(x)]. In Section 7.7.2 we discuss a closely related problem—namely, that of computing a mode of the distribution p(x; θ). The problem of computing the cumulant generating function arises in a variety of signal-processing problems, including likelihood ratio tests (for classiﬁcation and detection problems) and parameter estimation. The computation of mean parameters is also fundamental, and takes diﬀerent forms depending on the underlying graphical model. For instance, it corresponds to computing means and covariances in the Gaussian case, whereas for a multinomial MRF it corresponds to computing marginal distributions. The goal of this section is to show how both of these inference problems can be represented variationally—as the solution of an optimization problem. The variational principle that we develop, though related to the classical “free energy” approach of statistical physics (Yedidia et al., 2001), also has important diﬀerences. The classical principle yields a variational formulation for the cumulant generating function (or log partition function) in terms of optimizing over the space of all distributions. In our approach, on the other hand, the optimization is not deﬁned over all distributions—a very high or inﬁnite-dimensional space—but rather over the much lower dimensional space of mean parameters. As an important consequence,

172

A Variational Principle for Graphical Models

solving this variational principle yields not only generating function 9 8 the cumulant but also the full set of mean parameters μ = μα | α ∈ I . 7.3.1

Conjugate Duality

The cornerstone of our variational principle is the notion of conjugate duality. In this section, we provide a brief introduction to this concept, and refer the interested reader to the standard texts (Hiriart-Urruty and Lemar´echal, 1993; Rockafellar, 1970) for further details. As is standard in convex analysis, we consider extended real-valued functions, meaning that they take values in the extended real line R∗ := R ∪ {+∞}. Associated with any convex function f : Rd → R∗ is a conjugate dual function f ∗ : Rd → R∗ , which is deﬁned as follows: 8 9 f ∗ (y) := sup y, x − f (x) . (7.24) x∈Rd

This deﬁnition illustrates the concept of a variational deﬁnition: the function value f ∗ (y) is speciﬁed as the solution of an optimization problem parameterized by the vector y ∈ Rd . As illustrated in ﬁg. 7.4, the value f ∗ (y) has a natural geometric interpretation as the (negative) intercept of the hyperplane with normal (y, −1) that supports the epigraph of f . In particular, consider the family of hyperplanes of the form y, x−c, where y is a ﬁxed normal direction and c ∈ R is the intercept to be adjusted. Our goal is to ﬁnd the smallest c such that the resulting hyperplane supports the

#f# #u#

#linea# #epif#

#lineb# #ca#

#x# #cb#

Interpretation of conjugate duality in terms of supporting hyperplanes to the epigraph of f , deﬁned as epi(f ) := {(x, y) ∈ Rd × R | f (x) ≤ y}. The dual function is obtained by translating the family of hyperplane with normal y and intercept −c until it just supports the epigraph of f (the shaded region).

Figure 7.4

epigraph of f . Note that the hyperplane y, x − c lies below the epigraph of f if and only if the inequality y, x − c ≤ f (x) holds for all x ∈ Rd . Moreover,

7.3

An Exact Variational Principle for Inference

173

it can be seen8 that the smallest c for which this inequality is valid is given by 9 c∗ = supx∈Rd y, x − f (x) , which is precisely the value of the dual function. As illustrated in ﬁg. 7.4, the geometric interpretation is that of moving the hyperplane (by adjusting the intercept c) until it is just tangent to the epigraph of f . For convex functions meeting certain technical conditions, taking the dual twice recovers the original function. In analytical terms, this fact means that we can generate a variational representation for convex f in terms of its dual function as follows: 8 9 (7.25) f (x) = sup x, y − f ∗ (y) . y∈Rd

Our goal in the next few sections is to apply conjugacy to the cumulant generating function A associated with an exponential family, as deﬁned in equation 7.15. More speciﬁcally, its dual function takes the form A∗ (μ) := sup {θ, μ − A(θ)},

(7.26)

θ∈Θ

where we have used the fact that, by deﬁnition, the function value A(θ) is ﬁnite only if θ ∈ Θ. Here μ ∈ Rd is a vector of so-called dual variables of the same dimension as θ. Our choice of notation—using μ for the dual variables—is deliberately suggestive: as we will see momentarily, these dual variables turn out to be precisely the mean parameters deﬁned in equation 7.17. Example 7.7 To illustrate the computation of a dual function, consider a scalar Bernoulli random variable x ∈ {0, 1}, whose distribution can be written in the exponential family form as p(x; θ) = exp{θx − A(θ)}. The cumulant generating function is given by A(θ) = log[1 + exp(θ)], and there is a single dual variable μ = Eθ [x]. Thus, the variational problem 7.26 deﬁning A∗ takes the form 8 9 (7.27) A∗ (μ) = sup θμ − log[1 + exp(θ)] . θ∈R

If μ ∈ (0, 1), then taking derivatives shows that the supremum is attained at the unique θ ∈ R satisfying the well-known logistic relation θ = log[μ/(1 − μ)]. Substituting this logistic relation into equation 7.27 yields that for μ ∈ (0, 1), we have A∗ (μ) = μ log μ + (1 − μ) log(1 − μ). By taking limits μ → 1− and μ → 0+, it can be seen that this expression is valid for μ in the closed interval [0, 1]. Figure 7.5 illustrates the behavior of the supremum in equation 7.27 for μ ∈ / [0, 1]. From our geometric interpretation of the value A∗ (μ) in terms of supporting hyperplanes, the dual value is +∞ if no supporting hyperplane can be found. In this particular case, the log partition function A(θ) = log[1 + exp(θ)] is bounded below by the line θ = 0. Therefore, as illustrated in ﬁg. 7.5a, any slope μ < 0 cannot support epi A, which implies that A∗ (μ) = +∞. A similar picture holds for the case μ > 1, as shown in ﬁg. 7.5b. Consequently, the dual function is equal to +∞ for μ ∈ / [0, 1]. ♦

174

A Variational Principle for Graphical Models

#u#

#u#

#xlog2# #xlog2#

#epif#

#epif#

#x#

#x# #linea#

(a)

#linea#

(b)

Figure 7.5 Behavior of the supremum deﬁning A∗ (μ) for (a) μ < 0 and (b) μ > 1. The value of the dual function corresponds to the negative intercept of the supporting hyperplane to epi A with slope μ.

As the preceding example illustrates, there are two aspects to characterizing the dual function A∗ : determining its domain (i.e., the set on which it takes a ﬁnite value); and specifying its precise functional form on the domain. In example 7.7, the domain of A∗ is simply the closed interval [0, 1], and its functional form on its domain is that of the binary entropy function. In the following two sections, we consider each of these aspects in more detail for general graphical models in exponential form. 7.3.2

Sets of Realizable Mean Parameters

For a given μ ∈ Rd , consider the optimization problem on the right-hand side of equation 7.26: since the cost function is diﬀerentiable, a ﬁrst step in the solution is to take the derivative with respect to θ and set it equal to zero. Doing so yields the zero-gradient condition: μ = ∇A(θ) = Eθ [φ(x)],

(7.28)

where the second equality follows from the standard properties of A given in lemma 7.1. We now need to determine the set of μ ∈ Rd for which equation 7.28 has a solution. Observe that any μ ∈ Rd satisfying this equation has a natural interpretation as a globally realizable mean parameter—i.e., a vector that can be realized by taking expectations of the suﬃcient statistic vector φ. This observation motivates deﬁning the following set:

9 8 d φ(x)p(x)ν(dx) = μ , (7.29) M := μ ∈ R ∃ p(·) such that which corresponds to all realizable mean parameters associated with the set of suﬃcient statistics φ.

7.3

An Exact Variational Principle for Inference

175

Example 7.8 Gaussian Mean Parameters The Gaussian MRF, ﬁrst introduced in example 7.3, provides a simple illustration of the set M. Given the suﬃcient statistics that deﬁne a Gaussian, the associated mean parameters are either ﬁrst-order moments (e.g., μs = E[xs ]), or secondorder moments (e.g., μss = E[x2s ] and μst = E[xs xt ]). This full collection of mean parameters can be compactly represented in matrix form: ⎡ ⎤ 1 μ1 μ2 . . . μn ⎥ ⎢ ⎢ μ1 μ11 μ12 . . . μ1n ⎥ ⎢ ⎥ 1 ⎢ ⎥ (7.30) W (μ) := Eθ 1 x = ⎢ μ2 μ21 μ22 . . . μ2n ⎥ . ⎢ x .. .. .. .. ⎥ ⎢ .. ⎥ . . . . ⎦ ⎣ . μn μn1 μn2 . . . μnn The Schur product lemma (Horn and Johnson, 81985) implies det W (μ) = 9 that 8 9 det cov(x), so that a mean parameter vector μ = μs | s ∈ V ∪ μst | (s, t) ∈ E is globally realizable if and only if the matrix W (μ) is strictly positive deﬁnite. Thus, the set M is straightforward to characterize in the Gaussian case. ♦ Example 7.9 Marginal Polytopes We now consider the case of a multinomial MRF, ﬁrst introduced in example 7.4. With the choice of suﬃcient statistics (eq. 7.20), the associated mean parameters are simply local marginal probabilities, viz., μs;j := p(xs = j; θ) ∀ s ∈ V,

μst;jk := p((xs , xt ) = (j, k); θ) ∀ (s, t) ∈ E. (7.31)

In analogy to our earlier deﬁnition of θs (xs ), we deﬁne functional versions of the mean parameters as follows: μs;j I j (xs ), μst (xs , xt ) := μst;jk I jk (xs , xt ). (7.32) μs (xs ) := j∈Xs

(j,k)∈Xs ×Xt

With this notation, the set M consists of all singleton marginals μs (as s ranges over V ) and pairwise marginals μst (for edges (s, t) in the edge set E) that can be realized by a distribution with support on X n . Since the space X n has a ﬁnite number of elements, the set M is formed by taking the convex hull of a ﬁnite number of vectors. As a consequence, it must be a polytope, meaning that it can be described by a ﬁnite number of linear inequality constraints. In this discrete case, we refer to M as a marginal polytope, denoted by MARG(G); see ﬁg. 7.6 for an idealized illustration. As discussed in section 7.4.2, it is straightforward to specify a set of necessary conditions, expressed in terms of local constraints, that any element of MARG(G) must satisfy. However—and in sharp contrast to the Gaussian case—characterizing the marginal polytope exactly for a general graph is intractable, as it must require an exponential number of linear inequality constraints. Indeed, if it were possible to characterize MARG(G) with polynomial-sized set of constraints, then this would imply the polynomial-time solvability of various NP-complete problems (see section 7.7.2 for further discussion of this point). ♦

176

A Variational Principle for Graphical Models

#m

#margset# #p

#hypplane#

Geometrical illustration of a marginal polytope. Each vertex corresponds to the mean parameter μe := φ(e) realized by the distribution δe (x) that puts all of its mass on the conﬁguration e ∈ X n . The faces of the marginal polytope are speciﬁed by hyperplane constraints aj , μ ≤ bj .

Figure 7.6

7.3.3

Entropy in Terms of Mean Parameters

We now turn to the second aspect of the characterization of the conjugate dual function A∗ —that of specifying its precise functional form on its domain M. As might be expected from our discussion of maximum entropy in section 7.2.1, the form of the dual function A∗ turns out to be closely related to entropy. Accordingly, we begin by deﬁning the entropy in a bit more generality: Given a density function p taken with respect to base measure ν, its entropy is given by

H(p) = − p(x) log [p(x)] ν(dx) = −Ep [log p(x)]. (7.33) Xn

With this set up, now suppose that μ belongs to the interior of M. Under this assumption, it can be shown (Brown, 1986; Wainwright and Jordan, 2003b) that there exists a canonical parameter θ(μ) ∈ Θ such that Eθ(μ) [φ(x)] = μ.

(7.34)

Substituting this relation into the deﬁnition of the dual function (eq. 7.26) yields 0 1 A∗ (μ) = μ, θ(μ) − A(θ(μ)) = Eθ(μ) log p(x; θ(μ)) , which we recognize as the negative entropy −H(p(x; θ(μ))), where μ and θ(μ) are dually coupled via equation 7.34. Summarizing our development thus far, we have established that the dual function A∗ has the following form: ' −H(p(x; θ(μ))) if μ belongs to the interior of M ∗ A (μ) = (7.35) +∞ if μ is outside the closure of M. An alternative way to interpret this dual function A∗ is by returning to the maximum entropy problem originally considered in section 7.2.1. More speciﬁcally,

7.4

Exact Inference in Variational Form

177

suppose that we consider the optimal value of the maximum entropy problem given in 7.12, considered parametrically as a function of the constraints μ. Essentially, what we have established is that the parametric form of this optimal value function is the dual function—that is: A∗ (μ) = max H(p) p∈P

such that Ep [φα (x)] = μα for all α ∈ I.

(7.36)

In this context, the property that A∗ (μ) = +∞ for a constraint vector μ outside of M has a concrete interpretation: it corresponds to infeasibility of the maximum entropy problem (eq. 7.12).

Exact variational principle Given the form of the dual function (eq. 7.35), we can now use the conjugate dual relation 7.25 to express A in terms of an optimization problem involving its dual function and the mean parameters: 8 9 (7.37) A(θ) = sup θ, μ − A∗ (μ) . μ∈M

Note that the optimization is restricted to the set M of globally realizable mean parameters, since the dual function A∗ is inﬁnite outside of this set. Thus, we have expressed the cumulant generating function as the solution of an optimization problem that is convex (since it entails maximizing a concave function over the convex set M), and low dimensional (since it is expressed in terms of the mean parameters μ ∈ Rd ). In addition to representing the value A(θ) of the cumulant generating function, the variational principle 7.35 also has another important property. More speciﬁcally, the nature of our dual construction ensures that the optimum is always attained at the vector of mean parameters μ = Eθ [φ(x)]. Consequently, solving this optimization problem yields both the value of the cumulant generating function as well as the full set of mean parameters. In this way, the variational principle 7.37 based on exponential families diﬀers fundamentally from the classical free energy principle from statistical physics.

7.4

Exact Inference in Variational Form In order to illustrate the general variational principle given in equation 7.37, it is worthwhile considering important cases in which it can be solved exactly. Accordingly, this section treats in some detail the case of a Gaussian MRF on an arbitrary graph—for which we rederive the normal equations—as well as the case of a multinomial MRF on a tree, for which we sketch out a derivation of the sumproduct algorithm from a variational perspective. In addition to providing a novel perspective on exact methods, the variational principle 7.37 also underlies a variety of methods for approximate inference, as we will see in section 7.5.

178

A Variational Principle for Graphical Models

7.4.1

Exact Inference in Gaussian Markov Random Fields

We begin by considering the case of a Gaussian Markov random ﬁeld (MRF) on an arbitrary graph, as discussed in examples 7.3 and 7.8. In particular, we showed in the latter example that the set MGauss of realizable Gaussian mean parameters μ is determined by a positive deﬁniteness constraint on the matrix W (μ) of mean parameters deﬁned in equation 7.30. We now consider the form of the dual function A∗ (μ). It is well known (Cover and Thomas, 1991) that the entropy of a multivariate Gaussian random vector can be written as H(p) =

n 1 log det cov(x) + log 2πe, 2 2

where cov(x) is the n×n covariance matrix of x. By recalling the deﬁnition (eq. 7.30) of W (μ) and applying the Schur complement formula (Horn and Johnson, 1985), we see that det cov(x) = det W (μ), which implies that the dual function for a Gaussian can be written in the form n 1 A∗Gauss (μ) = − log det W (μ) − log 2πe, 2 2

(7.38)

valid for all μ ∈ MGauss . (To understand the negative signs, recall from equation 7.35 that A∗ is equal to the negative entropy for μ ∈ MGauss .) Combining this exact expression for A∗Gauss with our characterization of MGauss leads to AGauss (θ) =

sup W (μ) 0, W11 (μ)=1

8

U (θ), W (μ) +

9 n 1 log det W (μ) + log 2πe , 2 2 (7.39)

which corresponds to the variational principle 7.37 specialized to the Gaussian case. We now show how solving the optimization problem 7.39 leads to the normal equations for Gaussian inference. In order to do so, it is convenient to introduce the following notation for diﬀerent blocks of the matrices W (μ) and U (θ): 1 z T (μ) 0 z T (θ) , U (θ) = . (7.40) W (μ) = z(μ) Z(μ) z(θ) Z(θ) In this deﬁnition, the submatrices Z(μ) and Z(θ) are n × n, whereas z(μ) and z(θ) are n × 1 vectors. Now if W (μ) 0 were the only constraint in problem 7.39, then, using the fact that ∇ log det W = W −1 for any symmetric positive matrix W , the optimal solution to problem 7.39 would simply be W (μ) = −2[U (θ)]−1 . Accordingly, if we enforce the constraint [W (μ)]11 = 1 using a Lagrange multiplier λ, then it follows from the Karush-Kuhn-Tucker conditions (Bertsekas, 1995b) that the optimal solution will assume the form W (μ) = −2[U (θ) + λ∗ E11 ]−1 , where λ∗ is the optimal setting of the Lagrange multiplier and E11 is an (n + 1) × (n + 1) matrix with a one in the upper left hand corner, and zero in all other entries. Using the standard

7.4

Exact Inference in Variational Form

179

formula for the inverse of a block-partitioned matrix (Horn and Johnson, 1985), it is straightforward to verify that the blocks in the optimal W (μ) are related to the blocks of U (θ) by the relations: Z(μ) − z(μ)z T (μ) = −2[Z(θ)]−1 −1

z(μ) = −[Z(θ)]

z(θ)

(7.41a) (7.41b)

(The multiplier λ∗ turns out not to be involved in these particular blocks.) In order to interpret these relations, it is helpful to return to the deﬁnition of U (θ) given in equation 7.18, and the Gaussian density of equation 7.19. In this way, we see that the ﬁrst part of equation 7.41 corresponds to the fact that the covariance matrix is the inverse of the precision matrix, whereas the second part corresponds to the normal equations for the mean z(μ) of a Gaussian. Thus, as a special case of the general variational principle 7.37, we have rederived the familiar equations for Gaussian inference. It is worthwhile noting that the derivation did not exploit any particular features of the graph structure. The Gaussian case is remarkable in this regard, in that both the dual function A∗ and the set M of realizable mean parameters can be characterized simply for an arbitrary graph. However, many methods for solving the normal equations 7.41 as eﬃciently as possible, including Kalman ﬁltering on trees (Willsky, 2002), make heavy use of the underlying graphical structure. 7.4.2

Exact Inference on Trees

We now turn to the case of tree-structured Markov random ﬁelds, focusing for concreteness on the multinomial case, ﬁrst introduced in example 7.4 and treated in more depth in example 7.9. Recall from the latter example that for a multinomial MRF, the set M of realizable mean parameters corresponds to a marginal polytope, which we denote by MARG(G). There is an obvious set of local constraints that any member of MARG(G) must satisfy. For instance, given their interpretation as local marginal distributions, the vectors μs and μst must of course be nonnegative. In addition, they must satisfy & normalization conditions (i.e., xs μs (xs ) = 1), and the pairwise marginalization & conditions (i.e., xt μst (xs , xt ) = μs (xs )). Accordingly, we deﬁne for any graph G the following constraint set: μs (xs ) = 1, μst (xs , xt ) = μs (xs ) ∀(s, t) ∈ E}. LOCAL(G) := { μ ≥ 0 | xs

xt

(7.42) Since any set of singleton and pairwise marginals (regardless of the underlying graph structure) must satisfy these local consistency constraints, we are guaranteed that MARG(G) ⊆ LOCAL(G) for any graph G. This fact plays a signiﬁcant role in our later discussion in section 7.6 of the Bethe variational principle and sum-product on graphs with cycles. Of most importance to the current development is the following

180

A Variational Principle for Graphical Models

consequence of the junction tree theorem (see section 7.1.2, subsection on junctiontree representation): when the graph G is tree-structured, then LOCAL(T ) = MARG(T ). Thus, the marginal polytope MARG(T ) for trees has a very simple description given in 7.42. The second component of the exact variational principle 7.37 is the dual function A∗ . Here the junction tree framework is useful again: in particular, specializing the representation in equation 7.7 to a tree yields the following factorization: p(x; μ) =

μs (xs )

s∈V

(s,t)∈E

μst (xs , xt ) μs (xs )μt (xt )

(7.43)

for a tree-structured distribution in terms of its mean parameters μs and μst . From this decomposition, it is straightforward to compute the entropy purely as a function of the mean parameters by taking the logarithm and expectations and simplifying. Doing so yields the expression Hs (μs ) − Ist (μst ), (7.44) −A∗ (μ) = s∈V

(s,t)∈E

where the singleton entropy Hs and mutual information Ist are given by Hs (μs ) := −

μs (xs ) log μs (xs ),

Ist (μst ) :=

xs

μst (xs , xt ) log

xs ,xt

μst (xs , xt ) , μs (xs )μt (xt )

respectively. Putting the pieces together, the general variational principle 7.37 takes the following particular form: 3 2 Hs (μs ) − Ist (μst ) . (7.45) θ, μ + A(θ) = max μ∈LOCAL(T )

s∈V

(s,t)∈E

There is an important link between this variational principle for multinomial MRFs on trees, and the sum-product updates (eq. 7.6). In particular, the sum-product updates can be derived as an iterative algorithm for solving a Lagrangian dual formulation of the problem 7.45. This will be clariﬁed in our discussion of the Bethe variational principle in section 7.6.

7.5

Approximate Inference in Variational Form Thus far, we have seen how well-known methods for exact inference—speciﬁcally, the computation of means and covariances in the Gaussian case and the computation of local marginal distributions by the sum-product algorithm for treestructured problems—can be rederived from the general variational principle (eq. 7.37). It is worthwhile isolating the properties that permit an exact solution of the variational principle. First, for both of the preceding cases, it is possible to characterize the set M of globally realizable mean parameters in a straightfor-

7.5

Approximate Inference in Variational Form

181

ward manner. Second, the entropy can be expressed as a closed-form function of the mean parameters μ, so that the dual function A∗ (μ) has an explicit form. Neither of these two properties holds for a general graphical model in exponential form. As a consequence, there are signiﬁcant challenges associated with exploiting the variational representation. More precisely, in contrast to the simple cases discussed thus far, many graphical models of interest have the following properties: 1. the constraint set M of realizable mean parameters is extremely diﬃcult to characterize in an explicit manner. 2. the negative entropy function A∗ is deﬁned indirectly—in a variational manner— so that it too typically lacks an explicit form. These diﬃculties motivate the use of approximations to M and A∗ . Indeed, a broad class of methods for approximate inference—ranging from mean ﬁeld theory to cluster variational methods—are based on this strategy. Accordingly, the remainder of the chapter is devoted to discussion of approximate methods based on relaxations of the exact variational principle. 7.5.1

Mean Field Theory

We begin our discussion of approximate algorithms with mean ﬁeld methods, a set of algorithms with roots in statistical physics (Chandler, 1987). Working from the variational principle 7.37, we show that mean ﬁeld methods can be understood as solving an approximation thereof, with the essential restriction that the optimization is limited to a subset of distributions for which the dual function A∗ is relatively easy to characterize. Throughout this section, we will refer to a distribution with this property as a tractable distribution. Tractable Families Let H represent a subgraph of G over which it feasible to perform exact calculations (e.g., a graph with small treewidth); we refer to any such H as a tractable subgraph. In an exponential formulation, the set of all distributions that respect the structure of H can be represented by a linear subspace of canonical parameters. More speciﬁcally, letting I(H) denote the subset of indices associated with cliques in H, the set of canonical parameters corresponding to distributions structured according to H is given by: E(H) := {θ ∈ Θ | θα = 0 ∀ α ∈ I\I(H)}.

(7.46)

We consider some examples to illustrate: Example 7.10 Tractable Subgraphs The simplest instance of a tractable subgraph is the completely disconnected graph H0 = (V, ∅) (see ﬁg. 7.7b). Permissible parameters belong to the subspace E(H0 ) := {θ ∈ Θ | θst = 0 ∀ (s, t) ∈ E}, where θst refers to the collection of canonical parameters associated with edge (s, t). The associated distributions are

182

A Variational Principle for Graphical Models

of the product form p(x; θ) = s∈V p(xs ; θs ), where θs refers to the collection of canonical parameters associated with vertex s. To obtain a more structured approximation, one could choose a spanning tree T = (V, E(T )), as illustrated in ﬁg. 7.7c. In this case, we are free to choose the canonical parameters corresponding to vertices and edges in T , but we must set to zero any canonical parameters corresponding to edges not in the tree. Accordingly, the subspace of tree-structured distributions is given by E(T ) = {θ | θst = 0 ∀ (s, t) ∈ / E(T )}. ♦ For a given subgraph H, consider the set of all possible mean parameters that are realizable by tractable distributions: Mtract (G; H) := {μ ∈ Rd | μ = Eθ [φ(x)] for some θ ∈ E(H)}.

(7.47)

The notation Mtract (G; H) indicates that mean parameters in this set arise from taking expectations of suﬃcient statistics associated with the graph G, but that they must be realizable by a tractable distribution—i.e., one that respects the structure of H. See example 7.11 for an explicit illustration of this set when the tractable subgraph H is the fully disconnected graph. Since any μ that arises from a tractable distribution is certainly a valid mean parameter, the inclusion Mtract (G; H) ⊆ M(G) always holds. In this sense, Mtract is an inner approximation to the set M of realizable mean parameters. Optimization and Lower Bounds We now have the necessary ingredients to develop the mean ﬁeld approach to approximate inference. Let p(x; θ) denote the target distribution that we are interested in approximating. The basis of the mean ﬁeld method is the following fact: any valid mean parameter speciﬁes a lower bound on the cumulant generating function. Indeed, as an immediate consequence of the variational principle 7.37, we have: A(θ) ≥ θ, μ − A∗ (μ)

(7.48)

for any μ ∈ M. This inequality can also be established by applying Jensen’s inequality (Jordan et al., 1999). Since the dual function A∗ typically lacks an explicit form, it is not possible, at least in general, to compute the lower bound (eq. 7.48). The mean ﬁeld approach circumvents this diﬃculty by restricting the choice of μ to the tractable subset Mtract (G; H), for which the dual function has an explicit form A∗H . As long as μ belongs to Mtract (G; H), then the lower bound 7.48 will be computable. Of course, for a nontrivial class of tractable distributions, there are many such bounds. The goal of the mean ﬁeld method is the natural one: ﬁnd the best approximation μMF , as measured in terms of the tightness of the bound. This optimal approximation is speciﬁed as the solution of the optimization problem 8 9 μ, θ − A∗H (μ) , (7.49) sup μ∈Mtract (G;H)

7.5

Approximate Inference in Variational Form

(a)

183

(b)

(c)

Figure 7.7 Graphical illustration of the mean ﬁeld approximation. (a) Original graph is a 7 × 7 grid. (b) Fully disconnected graph, corresponding to a naive mean

ﬁeld approximation. (c) A more structured approximation based on a spanning tree.

which is a relaxation of the exact variational principle 7.37. The optimal value speciﬁes a lower bound on A(θ), and it is (by deﬁnition) the best one that can be obtained by using a distribution from the tractable class. An important alternative interpretation of the mean ﬁeld approach is in terms of minimizing the Kullback-Leibler (KL) divergence between the approximating (tractable) distribution and the target distribution. Given two densities p and q, the KL divergence is given by

p(x) log p(x)ν(dx). (7.50) D(p q) = q(x) n X To see the link to our derivation of mean ﬁeld, consider for a given mean parameter μ ∈ Mtract (G; H), the diﬀerence between the log partition function A(θ) and the quantity μ, θ − A∗H (μ): D(μ θ) = A(θ) + A∗H (μ) − μ, θ. A bit of algebra shows that this diﬀerence is equal to the KL divergence 7.50 with q = p(x; θ) and p = p(x; μ) (i.e., the exponential family member with mean parameter μ). Therefore, solving the mean ﬁeld variational problem 7.49 is equivalent to minimizing the KL divergence subject to the constraint that μ belongs to tractable set of mean parameters, or equivalently that p is a tractable distribution. 7.5.2

Naive Mean Field Updates

The naive mean ﬁeld (MF) approach corresponds to choosing a fully factorized or product distribution in order to approximate the original distribution. The naive mean ﬁeld updates are a particular set of recursions for ﬁnding a stationary point of the resulting optimization problem.

184

A Variational Principle for Graphical Models

Example 7.11 As an illustration, we derive the naive mean ﬁeld updates for the Ising model, which is a special case of the multinomial MRF deﬁned in example 7.4. It involves binary variables, so that Xs = {0, 1} for all vertices s ∈ V . Moreover, the canonical parameters are of the form θs (xs ) = θs xs and θst (xs , xt ) = θst xs xt for real numbers θs and θst . Consequently, the exponential representation of the Ising model has the form 9 8 θs xs + θst xs xt . p(x; θ) ∝ exp s∈V

(s,t)∈E

Letting H0 denote the fully disconnected graph (i.e., without any edges), the tractable set Mtract (G; H0 ) consists of all mean parameters {μs , μst } that arise from a product distribution. Explicitly, in this binary case, we have Mtract (G; H0 ) := {(μs , μst ) | 0 ≤ μs ≤ 1, μst = μs μt }. Moreover, the negative entropy of a product distribution over binary random 0 1 & ∗ variables decomposes into the sum AH0 (μ) = s∈V μs log μs +(1−μs ) log(1−μs ) . Accordingly, the associated naive mean ﬁeld problem takes the form 8 9 μ, θ − A∗H0 (μ) . max μ∈Mtract (G;H0 )

In this particular case, it is convenient to eliminate μst by replacing it by the product μs μt . Doing so leads to a reduced form of the problem: 2 3 0 1 μs log μs + (1 − μs ) log(1 − μs ) . max n θs μs + θst μs μt − {μs }∈[0,1]

s∈V

(s,t)∈E

s∈V

(7.51) Let F denote the function of μ within curly braces in equation 7.51. It can be seen that the function F is strictly concave in a given ﬁxed coordinate μs when all the other coordinates are held ﬁxed. Moreover, it is straightforward to show that the maximum over μs with μt , t = s ﬁxed is attained in the interior (0, 1), and can be found by taking the gradient and setting it equal to zero. Doing so yields the following update for μs : (7.52) μs ← σ θs + θst μt , t∈N (s)

where σ(z) := [1 + exp(−z)]−1 is the logistic function. Applying equation 7.52 iteratively to each node in succession amounts to performing coordinate ascent in the objective function for the mean ﬁeld variational problem 7.51. Thus, we have derived the update equation presented earlier in equation 7.10. ♦ Similarly, it is straightforward to apply the naive mean ﬁeld approximation to other types of graphical models, as we illustrate for a multivariate Gaussian.

7.5

Approximate Inference in Variational Form

185

Example 7.12 Gaussian Mean Field The mean parameters for a multivariate Gaussian are of the form μs = E[xs ], μss = E[x2s ] and μst = E[xs xt ] for s = t. Using only Gaussians in product form, the set of tractable mean parameters takes the form Mtract (G; H0 ) = {μ ∈ Rd | μst = μs μt ∀s = t, μss − μ2s > 0 }. As with naive mean ﬁeld on the Ising model, the constraints μst = μs μt for s = t can be imposed directly, thereby leaving only the inequality μss − μ2s > 0 for each node. The negative entropy of a Gaussian in product form can be written &n as A∗Gauss (μ) = − s=1 21 log(μss − μ2s ) − n2 log 2πe. Combining A∗Gauss with the constraints leads to the naive MF problem for a multivariate Gaussian: sup {(μs ,μss ) |

μss −μ2s >0}

8

U (θ), W (μ) +

n 1 s=1

2

log(μss − μ2s ) +

9 n log 2πe , 2

where the matrices U (θ) and W (μ) are deﬁned in equation 7.40. Here it should be understood that any terms μst , s = t contained in W (μ) are replaced with the product μs μt . Taking derivatives with respect to μss and μs and rearranging yields the & stationary conditions 2(μss1−μ2 ) = −θss and 2(μssμs−μ2 ) = θs + t∈N (s) θst μt . Since s s 8 θss < 0, we 9can combine both equations into the update μs ← − θ1ss θs + & t∈N (s) θst μt . In fact, the resulting algorithm is equivalent to the Gauss-Jacobi method for solving the normal equations, and so is guaranteed to converge under suitable conditions (Demmel, 1997), in which case the algorithm computes the ♦ correct mean vector [μ1 . . . μn ]. 7.5.3

Structured Mean Field and Other Extensions

Of course, the essential principles underlying the mean ﬁeld approach are not limited to fully factorized distributions. More generally, one can consider classes of tractable distributions that incorporate additional structure. This structured mean ﬁeld approach was ﬁrst proposed by Saul and Jordan (1996), and further developed by various researchers. In this section, we discuss only one particular example in order to illustrate the basic idea, and refer the interested reader elsewhere (Wainwright and Jordan, 2003b; Wiegerinck, 2000) for further details. Example 7.13 Structured Mean Field for Factorial Hidden Markov Models The factorial hidden Markov model, as described in Ghahramani and Jordan (1997), has the form shown in ﬁg. 7.8a. It consists of a set of M Markov chains (M = 3 in this diagram), which share at each time a common observation (shaded nodes). Such models are useful, for example, in modeling the joint dependencies between speech and video signals over time. Although the separate chains are independent a priori, the common observation induces an eﬀective coupling between all nodes at each time (a coupling which is

186

A Variational Principle for Graphical Models

captured by the moralization process mentioned earlier). Thus, an equivalent model is shown in ﬁg. 7.8b, where the dotted ellipses represent the induced coupling of each observation. #mbeta#

#mgam#

#etaalpha#

(a)

(b)

(c)

Figure 7.8 Structured mean ﬁeld approximation for a factorial HMM. (a) Original model consists of a set of hidden Markov models (deﬁned on chains), coupled at each time by a common observation. (b) An equivalent model, where the ellipses represent interactions among all nodes at a ﬁxed time, induced by the common observation. (c) Approximating distribution formed by a product of chainstructured models. Here μα and μδ are the sets of mean parameters associated with the indicated vertex and edge respectively.

A natural choice of approximating distribution in this case is based on the subgraph H consisting of the decoupled set of M chains, as illustrated in ﬁg. 7.8c. The decoupled nature of the approximation yields valuable savings on the computational side. In particular, it can be shown (Saul and Jordan, 1996; Wainwright and Jordan, 2003b) that all intermediate quantities necessary for implementing the structured mean ﬁeld updates can be calculated by applying the forward-backward algorithm (i.e., the sum-product updates as an exact method) to each chain separately. ♦ In addition to structured mean ﬁeld, there are various other extensions to naive mean ﬁeld, which we mention only in passing here. A large class of techniques, including linear response theory and the TAP method (Kappen and Rodriguez, 1998; Opper and Saad, 2001; Plefka, 1982), seek to improve the mean ﬁeld approximation by introducing higher-order correction terms. Although the lower bound on the log partition function is not usually preserved by these higher-order methods, Leisink and Kappen (2001) demonstrated how to generate tighter lower bounds based on higher-order expansions. 7.5.4

Geometric View of Mean Field

An important fact about the mean ﬁeld approach is that the variational problem (eq. 7.49) may be nonconvex, so that there may be local minima, and the mean ﬁeld updates can have multiple solutions.

7.5

Approximate Inference in Variational Form

187

One way to understand this nonconvexity is in terms of the set of tractable mean parameters: under fairly mild conditions, it can be shown (Wainwright and Jordan, 2003b) that the set Mtract (G; H) is nonconvex. Figure 7.9 provides a geometric illustration for the case of a multinomial MRF, for which the set M is a marginal polytope.

#m #tractpoly#

#margpoly#

The set Mtract (G; H) of mean parameters that arise from tractable distributions is a nonconvex inner bound on M(G). Illustrated here is the multinomial case where M(G) ≡ MARG(G) is a polytope. The circles correspond to mean parameters that arise from delta distributions with all their mass on a single conﬁguration , and belong to both M(G) and Mtract (G; H).

Figure 7.9

A practical consequence of this nonconvexity is that the mean ﬁeld updates are often sensitive to the initial conditions. Moreover, the mean ﬁeld method can exhibit spontaneous symmetry breaking, wherein the mean ﬁeld approximation is asymmetric even though the original problem is perfectly symmetric; see Jaakkola (2001) for an illustration of this phenomenon. Despite this nonconvexity, the mean ﬁeld approximation becomes exact for certain types of models as the number of nodes n grows to inﬁnity (Baxter, 1982). 7.5.5

Parameter Estimation and Variational Expectation Maximization

Mean ﬁeld methods also play an important role in the problem of parameter estimation, in which the goal is to estimate model parameters on the basis of partial observations. The expectation-maximization (EM) algorithm (Dempster et al., 1977) provides a general approach to maximum likelihood parameter estimation in the case in which some subset of variables are observed whereas others are unobserved. Although the EM algorithm is often presented as an alternation between an expectation step (E step) and a maximization step (M step), it is also possible to take a variational perspective on EM, and view both steps as maximization steps (Csiszar and Tusn’ady, 1984; Neal and Hinton, 1999). More concretely, in the exponential family setting, the E step reduces to the computation of expected suﬃcient statistics—i.e., mean parameters. As we have seen, the variational framework pro-

188

A Variational Principle for Graphical Models

vides a general class of methods for computing approximations of mean parameters. This observation suggests a general class of variational EM algorithms, in which the approximation provided by a variational inference algorithm is substituted for the mean parameters in the E step. In general, as a consequence of making such a substitution, one loses the guarantees that are associated with the EM algorithm. In the speciﬁc case of mean ﬁeld algorithms, however, a convergence guarantee is retained: in particular, the algorithm will converge to a stationary point of a lower bound for the likelihood function (Wainwright and Jordan, 2003b).

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm In this section, we turn to another important message-passing algorithm for approximate inference, known either as belief propagation or the sum-product algorithm. In section 7.4.2, we described the use of the sum-product algorithm for trees, in which context it is guaranteed to converge and perform exact inference. When the same message-passing updates are applied to graphs with cycles, in contrast, there are no such guarantees; nonetheless, this “loopy” form of the sum-product algorithm is widely used to compute approximate marginals in various signal-processing applications, including phase unwrapping (Frey et al., 2001), low-level vision (Freeman et al., 2000), and channel decoding (Richardson and Urbanke, 2001). The main idea of this section is the connection between the sum-product updates and the Bethe variational principle. The presentation given here diﬀers from the original work of Yedidia et al. (2001), in that we formulate the problem purely in terms of mean parameters and marginal polytopes. This perspective highlights a key point: mean ﬁeld and sum-product, though similar as message-passing algorithms, are fundamentally diﬀerent at the variational level. In particular, whereas the essence of mean ﬁeld is to restrict optimization to a limited class of distributions for which the negative entropy and mean parameters can be characterized exactly, the the sum-product algorithm, in contrast, is based on enlarging the constraint set and approximating the entropy function. The standard Bethe approximation applies to an undirected graphical model with potential functions involving at most pairs of variables, which we refer to as a pairwise Markov random ﬁeld. In principle, by selectively introducing auxiliary variables, any undirected graphical model can be converted into an equivalent pairwise form to which the Bethe approximation can be applied; see Weiss and Freeman (2000) for a detailed description of this procedure. Moreover, although the Bethe approximation can be developed more generally, we also limit our discussion to a multinomial MRF, as discussed earlier in examples 7.4 and 7.9. We also make use of the local marginal functions μs (xs ) and μst (xs , xt ), as deﬁned in equation 7.32. As discussed in Example 7.9, the set M associated with a multinomial MRF is the marginal polytope MARG(G). Recall that there are two components to the general variational principle 7.37: the set of realizable mean parameters (given by a marginal polytope in this case),

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm

189

and the dual function A∗ . Developing an approximation to the general principle requires approximations to both of these components, which we discuss in turn in the following sections. 7.6.1

Bethe Entropy Approximation

From equation 7.35, recall that dual function A∗ corresponds to the maximum entropy distribution consistent with a given set of mean parameters; as such, it typically lacks a closed-form expression. An important exception to this general rule is the case of a tree-structured distribution: as discussed in section 7.4.2, the function A∗ for a tree-structured distribution has a closed-form expression that is straightforward to compute; see, in particular, equation 7.44. Of course, the entropy of a distribution deﬁned by a graph with cycles will not, in general, decompose additively like that of a tree. Nonetheless, one can imagine using the decomposition in equation 7.44 as an approximation to the entropy. Doing so yields an expression known as the Bethe approximation to the entropy on a graph with cycles: Hs (μs ) − Ist (μst ). (7.53) HBethe (μ) := s∈V

(s,t)∈E

To be clear, the quantity HBethe (μ) is an approximation to the negative dual function −A∗ (μ). Moreover, our development in section 7.4.2 shows that this approximation is exact when the graph is tree-structured. An alternative form of the Bethe entropy approximation can be derived by writing mutual information in terms of entropies as Ist (μst ) = Hs (μs ) + Ht (μt ) − Hst (μst ). In particular, expanding the mutual information terms in this way, and & then collecting all the single-node entropy terms yields HBethe (μ) = s∈V (1 − & ds )Hs (μs ) + (s,t)∈E Hst (μst ), where ds denotes the number of neighbors of node s. This representation is the form of the Bethe entropy introduced by Yedidia et al. (2001); however, the form given in equation 7.53 turns out to be more convenient for our purposes. 7.6.2

Tree-Based Outer Bound

Note that the Bethe entropy approximation HBethe is certainly well deﬁned for any μ ∈ MARG(G). However, as discussed earlier, characterizing this polytope of realizable marginals is a very challenging problem. Accordingly, a natural approach is to specify a subset of necessary constraints, which leads to an outer bound on MARG(G). Let τs (xs ) and τst (xs , xt ) be a set of candidate marginal distributions. In section 7.4.2, we considered the following constraint set: τs (xs ) = 1, τst (xs , xt ) = τt (xt ) }. (7.54) LOCAL(G) = { τ ≥ 0 | xs

xs

190

A Variational Principle for Graphical Models

Although LOCAL(G) is an exact description of the marginal polytope for a treestructured graph, it is only an outer bound for graphs with cycles. (We demonstrate this fact more concretely in example 7.14.) For this reason, our change in notation— i.e., from μ to τ —is quite deliberate, with the goal of emphasizing that members τ of LOCAL(G) need not be realizable. We refer to members of LOCAL(G) as pseudomarginals (these are sometimes referred to as beliefs). Example 7.14 Pseudomarginals We illustrate using a binary random vector on the simplest possible graph for which LOCAL(G) is not an exact description of MARG(G)—namely, a single cycle with three nodes. Consider candidate marginal distributions {τs , τst } of the form βst 0.5 − βst , (7.55) τst := τs := 0.5 0.5 , 0.5 − βst βst where βst ∈ [0, 0.5] is a parameter to be speciﬁed independently for each edge (s, t). It is straightforward to verify that {τs , τst } belong to LOCAL(G) for any choice of βst ∈ [0, 0.5]. First, consider the setting βst = 0.4 for all edges (s, t), as illustrated in ﬁg. 7.10a. It is not diﬃcult to show that the resulting marginals thus deﬁned are realizable; in fact, they can be obtained from the distribution that places probability 0.35 on each of the conﬁgurations [0 0 0] and [1 1 1], and probability 0.05 on each of the remaining six conﬁgurations. Now suppose that we perturb one of the pairwise marginals—say τ13 —by setting β13 = 0.1. The resulting problem is illustrated in ﬁg. 7.10b. Observe that there are now strong (positive) dependencies between the pairs of variables (x1 , x2 ) and (x2 , x3 ): both pairs are quite likely to agree (with probability 0.8). In contrast, the pair (x1 , x3 ) can only share the

#Margset# 2

2

3

1

3

1

#Tmargset#

(a)

(b)

(c)

(a), (b): Illustration of the marginal polytope for a single cycle graph on three nodes. Setting βst = 0.4 for all three edges gives a globally consistent set of marginals. (b) With β13 perturbed to 0.1, the marginals (though locally consistent) are no longer globally so. (c) For a more general graph, an idealized illustration of the tree-based constraint set LOCAL(G) as an outer bound on the marginal polytope MARG(G).

Figure 7.10

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm

191

same value relatively infrequently (with probability 0.2). This arrangement should provoke some doubt. Indeed, it can be shown that τ ∈ / MARG(G) by attempting but failing to construct a distribution that realizes τ , or alternatively and much more directly by using the idea of semideﬁnite constraints (see example 7.15). ♦ More generally, ﬁgure 7.10c provides an idealized illustration of the constraint set LOCAL(G), and its relation to the exact marginal polytope MARG(G). Observe that the set LOCAL(G) is another polytope that is a convex outer approximation to MARG(G). It is worthwhile contrasting with the nonconvex inner approximation used by a mean ﬁeld approximation, as illustrated in ﬁg. 7.9. 7.6.3

Bethe Variational Problem and Sum-Product

Note that the Bethe entropy is also well deﬁned for any pseudomarginal in LOCAL(G). Therefore, it is valid to consider a constrained optimization problem over the set LOCAL(G) in which the cost function involves the Bethe entropy approximation HBethe . Indeed, doing so leads to the so-called Bethe variational problem: 8 9 θ, τ + Hs (τs ) − Ist (τst ) . (7.56) max τ ∈LOCAL(G)

s∈V

(s,t)∈E

Although ostensibly similar to a (structured) mean ﬁeld approach, the Bethe variational problem (BVP) is fundamentally diﬀerent in a number of ways. First, as discussed in section 7.5.1, a mean ﬁeld method is based on an exact representation of the entropy, albeit over a limited class of distributions. In contrast, with the exception of tree-structured graphs, the Bethe entropy is a bona ﬁde approximation to the entropy. For instance, it is not diﬃcult to see that it can be negative, which of course can never happen for an exact entropy. Second, the mean ﬁeld approach entails optimizing over an inner bound on the marginal polytope, which ensures that any mean ﬁeld solution is always globally consistent with respect to at least one distribution, and that it yields a lower bound on the log partition function. In contrast, since LOCAL(G) is a strict outer bound on the set of realizable marginals MARG(G), the optimizing pseudomarginals τ ∗ of the BVP may not be globally consistent with any distribution. 7.6.4

Solving the Bethe Variational Problem

Having formulated the Bethe variational problem, we now consider iterative methods for solving it. Observe that the set LOCAL(G) is a polytope deﬁned by O(n + |E|) constraints. A natural approach to solving the BVP, then, is to attach Lagrange multipliers to these constraints, and ﬁnd stationary points of the Lagrangian. A remarkable fact, established by Yedidia et al. (2001), is that the sum-product updates of equation 7.6 can be rederived as a method for trying to ﬁnd such Lagrangian stationary points.

192

A Variational Principle for Graphical Models

A bit more formally, for each xs ∈ Xs , let λst (xs ) be a Lagrange multiplier as& sociated with the constraint Cts (xs ) = 0, where Cts (xs ) := τs (xs ) − xt τst (xs , xt ). Our approach is to consider the following partial Lagrangian corresponding to the Bethe variational problem 7.56: 0 1 L(τ ; λ) := θ, τ + HBethe (τ ) + λts (xs )Cts (xs ) + λst (xt )Cst (xt ) . (s,t)∈E

xs

xt

The key insight of Yedidia et al. (2001) is that any ﬁxed point of the sum-product updates speciﬁes a pair (τ ∗ , λ∗ ) such that ∇τ L(τ ∗ ; λ∗ ) = 0,

∇λ L(τ ∗ ; λ∗ ) = 0

(7.57)

In particular, the Lagrange multipliers can be used to specify messages of the form Mts (xs ) = exp(λts (xs )). After taking derivatives of the Lagrangian and equating them to zero, some algebra then yields the familiar message-update rule: 8 9 exp θst (xs , xt ) + θt (xt ) Mut (xt ). (7.58) Mts (xs ) = κ xt

u∈N (t)\s

We refer the reader to Yedidia et al. (2001) or Wainwright and Jordan (2003b) for further details of this derivation. By construction, any ﬁxed point M ∗ of these updates speciﬁes a pair (τ ∗ , λ∗ ) that satisﬁes the stationary3 conditions given in equation 7.57. This variational formulation of the sum-product updates—namely, as an algorithm for solving a constrained optimization problem—has a number of important consequences. First of all, it can be used to guarantee the existence of sum-product ﬁxed points. Observe that the cost function in the Bethe variational problem 7.56 is continuous and bounded above, and the constraint set LOCAL(G) is nonempty and compact; therefore, at least some (possibly local) maximum is attained. Moreover, since the constraints are linear, there will always be a set of Lagrange multipliers associated with any local maximum (Bertsekas, 1995b). For any optimum in the relative interior of LOCAL(G), these Lagrange multipliers can be used to construct a ﬁxed point of the sum-product updates. For graphs with cycles, this Lagrangian formulation provides no guarantees on the convergence of the sum-product updates; indeed, whether or not the algorithm converges depends both on the potential strengths and the topology of the graph. Several researchers (Heskes et al.; Welling and Teh, 2001; Yuille, 2002) have proposed alternatives to sum-product that are guaranteed to converge, albeit at the price of increased computational cost. It should also be noted that with the exception of trees and other special cases (McEliece and Yildirim, 2002; Pakzad and Anantharam, 2002), the BVP is usually a nonconvex problem, in that HBethe fails to be concave. As a consequence, there may be multiple local optima to the BVP, and there are no guarantees that sum-product (or other iterative algorithms) will ﬁnd a global optimum.

7.6

The Bethe Entropy Approximation and the Sum-Product Algorithm

193

As illustrated in ﬁg. 7.10c, the constraint set LOCAL(G) of the Bethe variational problem is a strict outer bound on the marginal polytope MARG(G). Since the exact marginals of p(x; θ) must always lie in the marginal polytope, a natural question is whether solutions to the Bethe variational problem ever fall into the region LOCAL(G)\ MARG(G). There turns out be a straightforward answer to this question, stemming from an alternative reparameterization-based characterization of sum-product ﬁxed points (Wainwright et al., 2003b). One consequence of this characterization is that for any vector τ of pseudomarginals in the interior of LOCAL(G), it is possible to specify a distribution for which τ is a sum-product ﬁxed point. As a particular example, it is possible to construct a distribution p(x; θ) such that the pseudomarginal τ discussed in example 7.14 is a ﬁxed point of the sum-product updates. 7.6.5

Extensions Based on Clustering And Hypertrees

From our development in the previous section, it is clear that there are two distinct components to the Bethe variational principle: (1) the entropy approximation HBethe , and (2) the approximation LOCAL(G) to the set of realizable marginal parameters. In principle, the BVP could be strengthened by improving either one, or both, of these components. One natural generalization of the BVP, ﬁrst proposed by Yedidia et al. (2002) and further explored by various researchers (Heskes et al.; McEliece and Yildirim, 2002; Minka, 2001), is based on working with clusters of variables. The approximations in the Bethe approach are based on trees, which are special cases of junction trees based on cliques of size two. A natural strategy, then, is to strengthen the approximations by exploiting more complex junction trees, also known as hypertrees. Our description of this procedure is very brief, but further details can be found in various sources (Wainwright and Jordan, 2003b; Yedidia et al., 2002). Recall that the essential ingredients in Bethe variational principle are local (pseudo)marginal distributions on nodes and edges (i.e., pairs of nodes). These distributions, subject to edgewise marginalization conditions, are used to specify the Bethe entropy approximation. One way to improve the Bethe approach, which is based on pairs of nodes, is to build entropy approximations and impose marginalization constraints on larger clusters of nodes. To illustrate, suppose that the original graph is simply the 3 × 3 grid shown in ﬁg. 7.11a. A particular grouping of the nodes, which is known as Kikuchi four-plaque clustering in statistical physics (Yedidia et al., 2002), is illustrated in ﬁg. 7.11b. This operation creates four new “supernodes” or clusters, each consisting of four nodes from the original graph. These clusters, as well as their overlaps—which turn out to be critical to track for certain technical reasons (Yedidia et al., 2002)—are illustrated in ﬁg. 7.11c. Given a clustering of this type, we now consider a set of marginal distributions τh , where h ranges over the clusters. As with the singleton τs and pairwise τst that deﬁne the Bethe approximation, we require that these higher-order cluster & marginals be suitably normalized (i.e., x τh (xh ) = 1), and be consistent with one h

194

A Variational Principle for Graphical Models

1

2

3 1

2

3 1245

4

5

2356 25

6 4

5

6

7

8

9

45

5

56

58

7

8 (a)

4578

5689

9 (b)

(c)

Figure 7.11 (a) Ordinary 3 × 3 grid. (b) Clustering of the vertices into groups of 4, known as Kikuchi four-plaque clustering. (c) Poset diagram of the clusters as

well as their overlaps. Pseudomarginals on these subsets must satisfy certain local consistency conditions, and are used to deﬁne a higher-order entropy approximation. another whenever they overlap. More precisely, for any pair g ⊆ h, the following & = τg (xg ) must hold. Imposing marginalization condition {xh | xg =xg } τh (xh ) these normalization and marginalization conditions leads to a higher-order analog of the constraint LOCAL(G) previously deﬁned in equation 7.54. In analogy to the Bethe entropy approximation, we can also consider a hypertree-based approximation to the entropy. There are certain technical aspects to specifying such entropy approximations, in that it turns out to be critical to ensure that the local entropies are weighted with certain “overcounting” numbers (Wainwright and Jordan, 2003b; Yedidia et al., 2002). Without going into these details here, the outcome is another relaxed variational principle, which can be understood as a higher-level analog of the Bethe variational principle.

7.7

From the Exact Principle to New Approximations The preceding sections have illustrated how a variety of known methods—both exact and approximate—can be understood in an uniﬁed manner on the basis of the general variational principle given in equation 7.37. In this ﬁnal section, we turn to a brief discussion of several new approximate methods that also emerge from this same variational principle. Given space constraints, our discussion in this chapter is necessarily brief, but we refer to reader to the papers of (Wainwright and Jordan, 2003a,b; Wainwright et al., 2002, 2003a) for further details. 7.7.1

Exploiting Semideﬁnite Constraints for Approximate Inference

As discussed in section 7.5, one key component in any relaxation of the exact variational principle is an approximation of the set M of realizable mean parameters. Recall that for graphical models that involve discrete random variables, we refer to

7.7

From the Exact Principle to New Approximations

195

this set as a marginal polytope. Since any polytope is speciﬁed by a ﬁnite collection of halfspace constraints (see ﬁg. 7.6), one very natural way in which to generate an outer approximation is by including only a subset of these halfspace constraints. Indeed, as we have seen in section 7.6, it is precisely this route that the Bethe approximation and its clustering-based extensions follow. However, such polyhedral relaxations are not the only way in which to generate outer approximations to marginal polytopes. Recognizing that elements of the marginal polytope are essentially moments leads very naturally to the idea of a semideﬁnite relaxation. Indeed, the use of semideﬁnite constraints for characterizing moments has a very rich history, both with classical work (Karlin and Studden, 1966) on scalar random variables, and more recent work (Lasserre, 2001; Parrilo, 2003) on the multivariate case. Semideﬁnite Outer Bounds on Marginal Polytopes We use the case of a multinomial MRF deﬁned by a graph G = (V, E), as discussed in example 7.4, in order to illustrate the use of semideﬁnite constraints. Although the basic idea is quite generally applicable (Wainwright and Jordan, 2003b), herein we restrict ourselves to binary variables (i.e., Xs = {0, 1}) so as to simplify the exposition. Recall that the suﬃcient statistics in a binary MRF take the form of certain indicator functions, as deﬁned in equation 7.20. In fact, this representation is overcomplete (in that there are linear dependencies among the indicator functions); in the binary case, it suﬃces to consider only the suﬃcient statistics xs = I 1 (xs ) and xs xt = I 11 (xs , xt ). Our goal, then, is to characterize the set of all ﬁrst- and secondorder moments, deﬁned by μs = E[xs ] and μst = E[xs xt ] respectively, that arise from taking expectations with respect to a distribution with its support restricted to {0, 1}n . Rather than focusing on just the pairs μst for edges (s, t) ∈ E, it is convenient to consider the full collection of pairwise moments {μst |s, t ∈ V }. Suppose that we are given a vector μ ∈ Rd (where d = n + n2 ), and wish to assess whether or not it is a globally realizable moment vector (i.e., whether there & & exists some distribution p(x) such that μs = x p(x) xs and μst = x p(x) xs xt ). In order to derive a necessary condition, we suppose that such a distribution p exists, and then consider the following (n + 1) × (n + 1) moment matrix: ⎡ ⎤ 1 μ1 μ2 · · · μn−1 μn ⎢ ⎥ ⎢ μ1 μ1 μ12 · · · ··· μ1n ⎥ ⎢ ⎥ ⎢ ⎥ ' ( ⎢ ⎥ · · · · · · μ μ μ μ 2 21 2 2n 1 ⎢ ⎥ (7.59) = ⎢ .. Ep .. .. .. .. .. ⎥, 1 x ⎢ ⎥ x . . . . . ⎢ . ⎥ ⎢ ⎥ .. .. .. .. ⎢μ . . . . μn,(n−1) ⎥ ⎣ n−1 ⎦ μn μn1 μn2 · · · μ(n−1),n μn which we denote by M1 [μ]. Note that in calculating the form of this moment matrix, we have made use of the relation μs = μss , which holds because xs = x2s for any binary-valued quantity.

196

A Variational Principle for Graphical Models

We now observe that any such moment matrix is necessarily positive semideﬁnite, which we denote by M1 [μ] ! 0. (This positive semideﬁniteness can be veriﬁed as follows: letting y := (1, x), then for any vector a ∈ Rn+1 , we have aT M1 [μ]a = aT E[yyT ]a = E[ aT y 2 ], which is certainly nonnegative). Therefore, we conclude that the semideﬁnite constraint set SDEF1 := {μ ∈ Rd | M1 [μ] ! 0} is an outer bound on the exact marginal polytope. Example 7.15 To illustrate the use of the outer bound SDEF1 , recall the pseudomarginal vector τ that we constructed in example 7.14 for the single cycle on three nodes. In terms of our reduced representation (involving only expectations of the singletons xs and pairwise functions xs xt ), this pseudomarginal can be written as follows: τs = 0.5 for s = 1, 2, 3,

τ12 = τ23 = 0.4,

Suppose that we now construct the matrix M1 it takes the following form: ⎡ 1 0.5 ⎢ ⎢0.5 0.5 M1 [τ ] = ⎢ ⎢0.5 0.4 ⎣ 0.5 0.1

τ13 = 0.1.

for this trial set of mean parameters; ⎤ 0.5 ⎥ 0.4 0.1⎥ ⎥. 0.5 0.4⎥ ⎦ 0.4 0.5 0.5

A simple calculation shows that it is not positive deﬁnite, so that τ ∈ / SDEF1 . Since SDEF1 is an outer bound on the marginal polytope, this reasoning shows—in a very quick and direct manner— that τ is not a globally valid moment vector. In fact, the semideﬁnite constraint set SDEF1 can be viewed as the ﬁrst in a sequence of progressively tighter relaxations on the marginal polytope. Log-Determinant Relaxation We now show how to use such semideﬁnite constraints in approximate inference. Our approach is based on combining the ﬁrstorder semideﬁnite outer bound SDEF1 with Gaussian-based entropy approximation. The end result is a log-determinant problem that represents another relaxation of the exact variational principle (Wainwright and Jordan, 2003a). In contrast to the Bethe and Kikuchi approaches, this relaxation is convex (and hence has a unique optimum), and moreover provides an upper bound on the cumulant generating function. Our starting point is the familiar interpretation of the Gaussian as the maximum entropy distribution subject to covariance constraints (Cover and Thomas, :, its diﬀerential entropy 1991). In particular, given a continuous random vector x h(: x) is always upper bounded by the entropy of a Gaussian with matched covariance, or in analytical terms h(: x) ≤

n 1 log det cov(: x) + log(2πe), 2 2

(7.60)

:. The upper bound 7.60 is not directly where cov(: x) is the covariance matrix of x

7.7

From the Exact Principle to New Approximations

197

applicable to a random vector taking values in a discrete space (since diﬀerential entropy in this case diverges to minus inﬁnity). However, a straightforward discretization argument shows that for any discrete random vector x ∈ {0, 1}n , its (ordinary) discrete entropy can be upper bounded in terms of the matrix M1 [μ] of mean parameters as n 1 1 blkdiag[0, In ] + log(2πe), (7.61) H(x) = −A∗ (μ) ≤ log det M1 [μ] + 2 2 12 where blkdiag[0, In ] is a (n + 1) × (n + 1) block-diagonal matrix with a 1 × 1 zero block, and an n × n identity block. Finally, putting all the pieces together leads to the following result (Wainwright and Jordan, 2003a): the cumulant generating function A(θ) is upper bounded by the solution of the following log-determinant optimization problem: 2 3 0 1 1 1 n blkdiag[0, In ] + log(2πe). θ, μ + log det M1 (τ ) + A(θ) ≤ max τ ∈SDEF1 2 12 2 (7.62) Note that the constraint τ ∈ SDEF1 ensures that M1 (τ ) ! 0, and hence a fortiori 1 blkdiag[0, In ] is positive deﬁnite. Moreover, an important fact is that that M1 (τ )+ 12 the optimization problem in equation 7.62 is a determinant maximization problem, for which eﬃcient interior point methods have been developed (Vandenberghe et al., 1998). Just as the Bethe variational principle (eq. 7.56) is a tree-based approximation, the log-determinant relaxation (eq. 7.62) is a Gaussian-based approximation. In particular, it is worthwhile comparing the structure of the log-determinant relaxation (eq. 7.62) to the exact variational principle for a multivariate Gaussian, as described in section 7.4.1. In contrast to the Bethe variational principle, in which all of the constraints deﬁning the relaxation are local, this new principle (eq. 7.62) imposes some quite global constraints on the mean parameters. Empirically, these global constraints are important for strongly coupled problems, in which the performance log-determinant relaxation appears much more robust than the sum-product algorithm (Wainwright and Jordan, 2003a). In summary, starting from the exact variational principle (eq. 7.37), we have derived a new relaxation, whose properties are rather diﬀerent than the Bethe and Kikuchi variational principles. 7.7.2

Relaxations for Computing Modes

Recall from our introductory comments in section 7.1.2 that, in addition to the problem of computing expectations and likelihoods, it is also frequently of interest to compute the mode of a distribution. This section is devoted to a brief discussion of mode computation, and more concretely how the exact variational principle (eq. 7.37), as well as relaxations thereof, again turns out to play an important role.

198

A Variational Principle for Graphical Models

Zero-Temperature Limits In order to understand the role of the exact variational principle 7.37 in computing modes, consider a multinomial MRF of the form p(x; θ), as discussed in example 7.4. Of interest to us is the one-parameter family of distributions {p(x; βθ) | β > 0}, where β is the real number to be varied. At one extreme, if β = 0, then there is no coupling, and the distribution is simply uniform over all possible conﬁgurations. The other extreme, as β → +∞, is more interesting; in this limit, the distribution concentrates all of its mass on the conﬁguration (or subset of conﬁgurations) that are modes of the distribution. Taking this limit β → +∞ is known as zero-temperature limit, since the parameter β is typically viewed as inverse temperature in statistical physics. This argument suggests that there should be a link between computing modes and the limiting behavior of the marginalization problem as β → +∞. In order to develop this idea a bit more formally, we begin by observing that the exact variational principle 7.37 holds for the distribution p(x; βθ) for any value of β ≥ 0. It can be shown (Wainwright and Jordan, 2003b) that if we actually take a suitably scaled limit of this exact variational principle as β → +∞, then we recover the following variational principle for computing modes: max θ, φ(x) =

x∈X n

max μ∈MARG(G)

θ, μ

(7.63)

Since the log probability log p(x; θ) is equal to θ, φ(x) (up to an additive constant), the left-hand side is simply the problem of computing the mode of the distribution p(x; θ). On the right-hand side, we simply have a linear program, since the constraint set MARG(G) is a polytope, and the cost function θ, μ is linear in μ (with θ ﬁxed). This equivalence means that, at least in principle, we can compute a mode of the distribution by solving a linear program (LP) over the marginal polytope. The geometric interpretation is also clear: as illustrated in ﬁg. 7.6, vertices of the marginal polytope are in one-to-one correspondence with conﬁgurations x. Since any LP achieves its optimum at a vertex (Bertsimas and Tsitsiklis, 1997), solving the LP is equivalent to ﬁnding the mode. Linear Programming and Tree-Reweighted Max-Product Of course, the LP-based reformulation in equation 7.63 is not practically useful for precisely the same reasons as before—it is extremely challenging to characterize the marginal polytope MARG(G) for a general graph. Many computationally intractable optimization problems (e.g., MAX-CUT) can be reformulated as LPs over the marginal polytope, as in equation 7.63, which underscores the inherent complexity of characterizing marginal polytopes. Nonetheless, this variational formulation motivates the idea of forming relaxations using outer bounds on the marginal polytope. For various classes of problems in combinatorial optimization, both linear programming and semideﬁnite relaxations of this ﬂavor have been studied extensively. Here we brieﬂy describe an LP relaxation that is very natural given our development of the Bethe variational principle in section 7.6. In particular, we consider using the local constraint set LOCAL(G), as deﬁned in equation 7.54, as

7.7

From the Exact Principle to New Approximations

199

an outer bound of the marginal polytope MARG(G). Doing so leads to the following LP relaxation for the problem of computing the mode of a multinomial MRF: max θ, φ(x) =

x∈X n

max μ∈MARG(G)

θ, μ ≤

max

τ ∈LOCAL(G)

θ, μ.

(7.64)

Since the relaxed constraint set LOCAL(G)—like the original set MARG(G)—is a polytope, the relaxation on the right-hand side of equation 7.64 is a linear program. Consequently, the optimum of the relaxed problem must be attained at a vertex (possibly more than one) of the polytope LOCAL(G).

#eparam1#

#eparam1#

#Margset#

#Margset#

The constraint set LOCAL(G) is an outer bound on the exact marginal polytope. Its vertex set includes all the vertices of MARG(G), which are in one-toone correspondence with optimal solutions of the integer program. It also includes additional fractional vertices, which are not vertices of MARG(G).

Figure 7.12

We say that a vertex of LOCAL(G) is integral if all of its components are zero or one, and fractional otherwise. The distinction between fractional and integral vertices is crucial, because it determines whether or not the LP relaxation 7.64 speciﬁed by LOCAL(G) is tight. In particular, there are only two possible outcomes to solving the relaxation: 1. The optimum is attained at a vertex of MARG(G), in which case the upper bound in equation 7.64 is tight, and a mode can be obtained. 2. The optimum is attained only at one or more fractional vertices of LOCAL(G), which lie strictly outside MARG(G). In this case, the upper bound of equation 7.64 is loose, and the relaxation does not output the optimal conﬁguration. Figure 7.12 illustrates both of these possibilities. The vector θ1 corresponds to case 1, in which the optimum is attained at a vertex of MARG(G). The vector θ2 represents a less fortunate setting, in which the optimum is attained only at a fractional vertex of LOCAL(G). In simple cases, one can explicitly demonstrate a fractional vertex of the polytope LOCAL(G). Given the link between the sum-product algorithm and the Bethe variational principle, it would be natural to conjecture that the max-product algorithm can be derived as an algorithm for solving the LP relaxation 7.64. For trees (in which

200

A Variational Principle for Graphical Models

case the LP 7.64 is exact), this conjecture is true: more precisely, it can be shown (Wainwright et al., 2003a) that the max-product algorithm (or the Viterbi algorithm) is an iterative method for solving the dual problem of the LP 7.64. However, this statement is false for graphs with cycles, since it is straightforward to construct problems (on graphs with cycles) for which the max-product algorithm will output a nonoptimal conﬁguration. Consequently, the max-product algorithm does not specify solutions to the dual problem, since any LP relaxation will output either a conﬁguration with a guarantee of correctness, or a fractional vertex. However, Wainwright et al. (2003a) derive a tree-reweighted analog of the maxproduct algorithm, which does have provable connections to dual optimal solutions of the tree-based relaxation 7.64.

7.8

Conclusion A fundamental problem that arises in applications of graphical models—whether in signal processing, machine learning, bioinformatics, communication theory, or other ﬁelds—is that of computing likelihoods, marginal probabilities, and other expectations. We have presented a variational characterization of the problem of computing likelihoods and expectations in general exponential-family graphical models. Our characterization focuses attention on both the constraint set and the objective function. In particular, for exponential-family graphical models, the constraint set M is a convex subset in a ﬁnite-dimensional space, consisting of all realizable mean parameters. The objective function is the sum of a linear function and an entropy function. The latter is a concave function, and thus the overall problem—that of maximizing the objective function over M—is a convex problem. In this chapter, we discussed how the junction tree algorithm and other exact inference algorithms can be understood as particular methods for solving this convex optimization problem. In addition, we showed that a variety of approximate inference algorithms—including loopy belief propagation, general cluster variational methods and mean ﬁeld methods—can be understood as methods for solving particular relaxations of the general variational principle. More concretely, we saw that belief propagation involves an outer approximation of M whereas mean ﬁeld methods involve an inner approximation of M. In addition, this variational principle suggests a number of new inference algorithms, as we brieﬂy discussed. It is worth noting certain limitations inherent to the variational framework as presented in this chapter. In particular, we have not discussed curved exponential families, but instead limited our treatment to regular families. Curved exponential families are useful in the context of directed graphical models, and further research is required to develop a general variational treatment of such models. Similarly, we have dealt exclusively with exponential family models, and not treated nonparametric models. One approach to exploiting variational ideas for nonparametric models

NOTES

201

is through exponential family approximations of nonparametric distributions; for example, Blei and Jordan (2004) have presented inference methods for Dirichlet process mixtures that are based on the variational framework presented here.

Notes

1 The

Gaussian case is an important exception to this statement. graph is triangulated means that every cycle of length four or longer has a chord. 3 Some care is required in dealing with the boundary conditions τ (x ) ≥ 0 and τ (x , x ) ≥ 0; s s st s t see Yedidia et al. (2001) for further discussion. 2 That

8

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

Hans-Georg Zimmermann, Ralph Grothmann, Anton Maximilian Sch¨ afer, and Christoph Tietz

Recurrent neural networks are typically considered to be relatively simple architectures, which come along with complicated learning algorithms. Most researchers focus on improving these algorithms. Our approach is diﬀerent: Rather than focusing on learning and optimization algorithms, we concentrate on the network architecture. Unfolding in time is a well-known example of this modeling philosophy. Here, a temporal algorithm is transferred into an architectural framework such that the learning can be done using an extension of standard error backpropagation. As we will show, many diﬃculties in the modeling of dynamical systems can be solved with neural network architectures. We exemplify architectural solutions for the modeling of open systems and the problem of unknown external inﬂuences. Another research area is the modeling of high-dimensional systems with large neural networks. Instead of modeling, e.g., a ﬁnancial market as small sets of time series, we try to integrate the information from several markets into an integrated model. Standard neural networks tend to overﬁt, like other statistical learning systems. We will introduce a new recurrent neural network architecture in which overﬁtting and the associated loss of generalization abilities is not a major problem. In this context we will point to diﬀerent sources of uncertainty which have to be handled when dealing with recurrent neural networks. Furthermore, we will show that sparseness of the network’s transition matrix is not only important to dampen overﬁtting but also provides new features such as an optimal memory design.

8.1

Introduction Recurrent neural networks (RNNs) allow the identiﬁcation of dynamical systems in the form of high-dimensional, nonlinear state space models. They oﬀer an explicit modeling of time and memory and allow us, in principle, to model any type of dy-

204

recurrent networks

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

neural

error correction neural networks

dynamical consistent neural networks

namical systems (Elman, 1990; Haykin, 1994; Kolen and Kremer, 2001; Medsker and Jain, 1999). The basic concept is as old as the theory of artiﬁcial neural networks, so, e.g., unfolding in time of neural networks and related modiﬁcations of the backpropagation algorithm can be found in Werbos (1974) and Rumelhart et al. (1986). Diﬀerent types of learning algorithms are summarized by Pearlmutter (1995). Nevertheless, over the last 15 years most time series problems have been approached with feedforward neural networks. The appeal of modeling time and memory in recurrent networks is opposed to the apparently better numerical tractability of a pattern-recognition approach as represented by feedforward neural networks. Still, some researchers did enhance the theory of recurrent neural networks. Recent developments are summarized in the books of Haykin (1994), Kolen and Kremer (2001), Sooﬁ and Cao (2002), and Medsker and Jain (1999). Our approach diﬀers from the outlined research directions in a signiﬁcant but, at ﬁrst sight nonobvious, way. Instead of focusing on algorithms, we put network architectures in the foreground. We show that a network architecture automatically implies using an adjoint solution algorithm for the parameter identiﬁcation problem. This correspondence between architecture and equations holds for simple as well as complex network architectures. The underlying assumption is that the associated parameter optimization problem is solved by error backpropagation through time, i.e., a shared weights extension of the standard error backpropagation algorithm. In technical and economical applications virtually all systems of interest are open dynamical systems (see section 8.2). This means that the dynamics of the system is determined partly by an autonomous development and partly by external drivers of the system environment. The measured data always reﬂect a superposition of both parts. If we are interested in forecasting the development of the system, extracting the autonomous subsystem is the most relevant task. It is the only part of the open system that can be predicted (see subsec. 8.2.3). A related question is the sequence length of the unfolding in time which is necessary to approximate the recurrent system (see subsec. 8.2.2). The outlined concepts are only applicable if we have a perfectly speciﬁed open dynamical system, where all external drivers are known. Unfortunately, this assumption is virtually never fulﬁlled in real-world applications. Even if we knew all the external system drivers, it would be questionable whether an appropriate amount of training data would be available. As a consequence, the task of identifying the open system is misspeciﬁed right from the beginning. On this problem, we introduce error correction neural networks (ECNN) (Zimmermann et al., 2002b) (see section 8.2.4). Another weakness of our modeling framework is the implicit assumption that we only have to analyze a small number of time series. This is also uncommon in real-world applications. For instance, in economics we face coherent markets and not a single interest or foreign exchange rate. A market or a complex technical plant is intrinsically high dimensional. Now the major problem is that all our neural networks tend to overﬁt if we increase the model dimensionality in order to approach the true high-dimensional system dynamics. We therefore present recurrent network

8.1

Introduction

uncertainties

function & structure

205

architectures, which work even for very large state spaces (see section 8.3). These networks also combine diﬀerent operations of small neural networks (e.g., processing of input information) into one shared state transition matrix. Our experiments indicate that this stabilizes the model behavior to a large extent (see section 8.3.1). If one iterates an open system into the future, the standard assumption is that the system environment remains constant. As this is not true for most real-world applications, we introduce dynamical consistent recurrent neural networks, which try to forecast also the external inﬂuences (see section 8.3.2). We then combine the concepts of large networks and dynamic consistency with error correction. We show that ECNNs can be extended in a slightly diﬀerent way than basic recurrent networks (see section 8.3.3). We also demonstrate that some types of dynamical systems can more easily be analyzed with a (dynamical consistent) recurrent network, while others are more appropriate for ECNNs. Our intention is to merge the diﬀerent aspects of the competing network architectures within a single recurrent neural network. We call it DCNN, for dynamical consistent neural network. We found that DCNNs allow us to model even small deviations in the dynamics without losing the generalization abilities of the model. We point out that the networks presented so far create state trajectories of the dynamics that are close to the observed ones, whereas the DCNN evolves exactly on the observed trajectories (see section 8.3.4). Finally, we introduce a DCNN architecture for partially known observables to generalize our models from a diﬀerentiation between past and future to the (time-independent) availability of information (see section 8.3.5). The identiﬁcation and forecasting of dynamical systems has to cope with a number of uncertainties in the underlying data as well as in the development of the dynamics (see section 8.4). Cleaning noise is a technique which allows the model itself—within the training process—to correct corrupted or noisy data (see section 8.4.1). Working with ﬁnite unfolding in time brings up the problem of initializing the internal state at the ﬁrst time step. We present diﬀerent approaches to achieve a desensitization of the model’s behavior from the initial state and simultaneously improve the generalization abilities (see section 8.4.2). To stabilize the network against uncertainties of the environment’s future development we further apply noise to the inputs in the future part of the network (see section 8.4.3). Working with (high-dimensional) recurrent networks raises the question of how a desired network function can be supported by a certain structure of the transition matrix (see section 8.5). As we will point out, sparseness alone is not suﬃcient to optimize the network functions regarding conservation and superposition of information. Only with an inﬂation of the internal dimension of the recurrent neural network can we implement an optimal balance between memory and computation eﬀects (see sections 8.5.1 and 8.5.2). In this context we work out that sparseness of the transition matrix is actually a necessary condition for large neural networks (see section 8.5.3). Furthermore we analyze the information ﬂow in sparse networks and present an architectural solution which speeds up the distribution of information (see section 8.5.4).

206

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

Finally, section 8.6 summarizes our contributions to the research ﬁeld of recurrent neural networks.

8.2

Recurrent Neural Networks (RNN) Figure 8.1) illustrates a dynamical system (Zimmermann and Neuneier, 2001, p. 321).

y

Dynamical System

s

u Figure 8.1 Identiﬁcation of a dynamical system using a discrete time description: input u, hidden states s, and output y .

The dynamical system (ﬁg. 8.1) can be described for discrete time grids as a set of equations (eq. 8.1), consisting of a state transition and an output equation (Haykin, 1994; Kolen, 2001):

system identiﬁcation

st+1 = f (st , ut )

state transition

yt

output equation

= g(st )

(8.1)

The state transition is a mapping from the present internal hidden state of the system st and the inﬂuence of external inputs ut to the new state st+1 . The output equation computes the observable output yt . The system can be viewed as a partially observable autoregressive dynamic state transition st → st+1 which is also driven by external forces ut . Without the external inputs the system is called an autonomous system (Haykin, 1994; Mandic and Chambers, 2001). However, in reality most systems are driven by a superposition of an autonomous development and external inﬂuences. The task of identifying the dynamical system of equation 8.1 can be stated as the problem of ﬁnding (parameterized) functions f and g such that a distance measurement (eq. 8.2) between the observed data ytd and the computed data yt of

8.2

Recurrent Neural Networks (RNN)

207

the model is minimal:1 T

yt − ytd

2

→ min f,g

t=1

(8.2)

If we assume that the state transition does not depend on st , i.e., yt = g(st ) = g(f (ut−1 )), we are back in the framework of feedforward neural networks (Neuneier and Zimmermann, 1998). However, the inclusion of the internal hidden dynamics makes the modeling task much harder, because it allows varying intertemporal dependencies. Theoretically, in the recurrent framework an event st+1 is explained by a superposition of external inputs ut , ut−1 , . . . from all previous time steps (Haykin, 1994; Mandic and Chambers, 2001). 8.2.1 basic RNN

Representing Dynamic Systems by Recurrent Neural Networks

The identiﬁcation task of equations 8.1 and 8.2 can be easily modeled by a recurrent neural network (Haykin, 1994; Zimmermann and Neuneier, 2001) st+1 = tanh(Ast + c + But )

state transition

yt

output equation

= Cst

(8.3)

where A, B, and C are weight matrices of appropriate dimensions and c is a bias, which handles oﬀsets in the input variables ut . Note that the output equation yt = Cst is implemented as a linear function. It is straightforward to show that this is not a functional restriction by using an augmented inner state vector (Zimmermann and Neuneier, 2001, pp. 322–323). By specifying the functions f and g as a neural network with weight matrices A, B and C and a bias vector c, we have transformed the system identiﬁcation task of equation 8.2 into a parameter optimization problem: T

yt − ytd

t=1

2

→ min

A,B,C,c

(8.4)

As Hornik et al. (1992) proved for feedforward neural networks, it can be shown that recurrent neural networks (eq. 8.3) are universal approximators, as they can approximate any arbitrary dynamical system (eq. 8.1) with a continuous output function g. 8.2.2

Finite Unfolding in Time

In this section we discuss an architectural representation of recurrent neural networks that enables us to solve the parameter optimization problem of equation 8.4 by an extended version of standard backpropagation (Haykin, 1994; Rumelhart et al., 1986).2 Figure 8.2 unfolds the network of equation 8.3 (ﬁg. 8.2, left) over time using shared weight matrices A, B, and C (ﬁg. 8.2, right). Shared weights

208

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

share the same memory for storing their weights, i.e., the weight values are the same at each time step of the unfolding and for every pattern t ∈ {1, . . . , T } (Haykin, 1994; Rumelhart et al., 1986). This guarantees that we have in every time step the same dynamics.

y

C

C s

c

advantages of the RNN

(...)

(...)

y t−2

C A

s t−3

c B

u

backpropagation through time

A c

B

Figure 8.2

y t−3

(...)

C A

s t−2

ut−4

B ut−3

yt

s t−1

A

st

c B ut−2

y t+1

C

C A c

c B

y t−1

C A

s t+1

c B ut−1

B ut

Finite unfolding using shared weight matrices A, B , and C .

We approximate the recurrence of the system with a ﬁnite unfolding which truncates after a certain number of time steps m ∈ N. The important question to solve is the determination of the correct amount of past information needed to predict yt+1 . Since the outputs are explained by more and more external information, the error of the outputs is decreasing with each additional time step from left to right until a minimum error is achieved. This saturation level indicates the maximum number of time steps m which contribute relevant information for modeling the present time state. A more detailed description is given in Zimmermann and Neuneier (2001). We train the unfolded recurrent neural network shown in ﬁg. 8.2 (right) with error backpropagation through time, which is a shared weights extension of standard backpropagation (Haykin, 1994; Rumelhart et al., 1986). Error backpropagation is an eﬃcient way of calculating the partial derivatives of the network error function. Thus, all parts of the network are provided with error information. In contrast to typical feedforward neural networks, RNNs are able to explicitly model memory. This allows the identiﬁcation of intertemporal dependencies. Furthermore, recurrent networks contain less free parameters. In a feedforward neural network an expansion of the delay structure automatically increases the number of weights (left panel of ﬁg. 8.3). In the recurrent formulation, the shared matrices A, B, and C are reused when more delayed input information from the past is needed (right panel of ﬁg. 8.3). Additionally, if weights are shared more often, more gradient information is available for learning. As a consequence, potential overﬁtting is not as dangerous in recurrent as in feedforward networks. Due to the inclusion of temporal structure in the network architecture, our approach is applicable to tasks where only a small training set is available (Zimmermann and Neuneier, 2001, p. 325).

8.2

Recurrent Neural Networks (RNN)

209

y t+1

y t−2

V

C

C

s t−2

Hidden c

c W

A

s t−1

c

ut−3

y t+1

C A

st

c B

B

ut−3 ut−2 ut−1 ut

yt

y t−1

C A

B

ut−2

s t+1

c B

ut−1

ut

Figure 8.3 An additional time step leads in the feedforward framework (left) with yt+1 = V tanh(W u + c) to a higher dimension of the input vector u, whereas the number of free parameters remains constant in recurrent networks (right), due to the use of shared weights.

8.2.3

Overshooting

An obvious generalization of the network in ﬁg. 8.2 is the extension of the autonomous recurrence (matrix A) in future direction t + 2, t + 3, . . . (see ﬁg. 8.4) (Zimmermann and Neuneier, 2001, pp. 326–327). If this so-called overshooting leads to good predictions, we get a whole sequence of forecasts as an output. This is especially interesting for decision support systems. The number of autonomous iterations into the future, which we deﬁne with n ∈ N, most often depends on the required forecast horizon of the application. Note that overshooting does not add new parameters, since the shared weight matrices A and C are reused.

y t−2

y t−1

C s t−2 c

ut−3

Figure 8.4

C A

s t−1

c B

yt

st

c B ut−2

C

C A

A

C

s t+1

c B ut−1

y t+3

y t+2

y t+1

A c

s t+2

y t+4

C A c

s t+3

C A

s t+4

c

B ut

Overshooting extends the autonomous part of the dynamics.

The most important property of the overshooting network (ﬁg. 8.4) is the concatenation of an input-driven system and an autonomous system. One may argue that the unfolding-in-time network (ﬁg. 8.2) already consists of recurrent functions, and that this recurrent structure has the same modeling characteristics as the overshooting network. This is deﬁnitely not true, because the learning algorithm

210

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

leads to diﬀerent models for each of the architectures. Backpropagation learning usually tries to model the relationship between the most recent inputs and the output because the fastest adaptation takes place in the shortest path between input and output. Thus, learning mainly focuses on ut . Only later in the training process may learning also extract useful information from input vectors uτ (t − m ≤ τ < t) which are more distant from the output. As a consequence, the unfolding-in-time network (ﬁg. 8.2, right) tries to rely as much as possible on the part of the dynamics which is driven by the most recent inputs ut , . . . , ut−k with k < m. In contrast, the overshooting network (ﬁg. 8.4) forces the learning through additional future outputs yt+2 , . . . , yt+n to focus on modeling an internal autonomous dynamics (Zimmermann and Neuneier, 2001). In summary, overshooting generates additional valuable forecast information about the analyzed dynamical system and stabilizes learning. 8.2.4

Error Correction Neural Networks (ECNN)

If we have a complete description of all external inﬂuences, recurrent neural networks (eq. 8.3) allow us to identify the intertemporal relationships (Haykin, 1994). Unfortunately, our knowledge about the external forces is typically incomplete and our observations might be noisy. Under such conditions, learning with ﬁnite data sets leads to the construction of incorrect causalities due to learning by heart (overﬁtting). The generalization properties of such a model are questionable (Neuneier and Zimmermann, 1998). If we are unable to identify the underlying system dynamics due to insuﬃcient input information or unknown inﬂuences, we can refer to the actual model error yt − ytd , which can be interpreted as an indicator that our model is misleading. Handling this error information as an additional input, we extend equation 8.1, obtaining: st+1 = f (st , ut , yt − ytd ), yt

error correction network

= g(st ).

(8.5)

The state transition st+1 is a mapping from the previous state st , external inﬂuences ut , and a comparison between model output yt and observed data ytd . If the model error (yt − ytd ) is zero, we have a perfect description of the dynamics. However, due to unknown external inﬂuences or noise, our knowledge about the dynamics is often incomplete. Under such conditions, the model error (yt − ytd ) quantiﬁes the model’s misﬁt and serves as an indicator of short-term eﬀects or external shocks (Zimmermann et al., 2002b). Using weight matrices A, B, C and D of appropriate dimensions corresponding to st , ut , and (yt −ytd ) and a bias c, a neural network approach to 8.5 can be written

8.2

Recurrent Neural Networks (RNN)

211

as st+1 = tanh(Ast + c + But + D tanh(Cst − ytd )), yt

system identiﬁcation

In 8.6 the output yt is computed by Cst and compared to the observation ytd . The matrix D adjusts a possible diﬀerence in the dimension between the error correction term and st . The system identiﬁcation is now a parameter optimization task of appropriately sized weight matrices A, B, C, D, and the bias c (Zimmermann et al., 2002b): T

yt − ytd

2

→

t=1

ﬁnite unfolding

(8.6)

= Cst .

min

(8.7)

A,B,C,D,c

We solve the system identiﬁcation task of 8.7 by ﬁnite unfolding in time using shared weights (see section 8.2.2). Figure 8.5 depicts the resulting neural network solution of 8.6. A zt−2 D

s t−1 C

zt−1 D

c

−Id d yt−2

A

A

st

C

zt

D

c

B ut−2

−Id d yt−1

s t+1 C

c

B ut−1

−Id ytd

A s t+2 C

yt+1 c

s t+3 C

yt+2

yt+3

c

B ut

Error correction neural network (ECNN) using unfolding in time and overshooting. Note that −Id is the ﬁxed negative of an appropriate-sized identity matrix, while zτ with t − m ≤ τ ≤ t are output clusters with target values of zero in order to optimize the error correction mechanism.

Figure 8.5

overshooting

The ECNN (eq. 8.6) is best understood by analyzing the dependencies of st , ut , zt = Cst − ytd , and st+1 . The ECNN has two diﬀerent inputs: the externals ut directly inﬂuencing the state transition; and the targets ytd . Only the diﬀerence between yt and ytd has an impact on st+1 (Zimmermann et al., 2002b). At all future time steps t < τ ≤ t + n, we have no compensation of the internal expectations yτ , and thus the system oﬀers forecasts yτ = Csτ . The autonomous part of the ECNN is—analogous to the RNN case (see section 8.2.3)—extended into the future by overshooting. Besides all advantages described in section 8.2.3, overshooting inﬂuences the learning of the ECNN in an extended way. A forecast provided by the ECNN is in general, based on a modeling of the recursive structure of a dynamical system (coded in the matrix A) and on the error correction mechanism which acts as an external input (coded in C and D). Now, the overshooting enforces an autoregressive substructure allowing longterm forecasts. Of course, we have to supply target values for the additional output

212

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

clusters yτ , t < τ ≤ t + n. Due to the shared weights, there is no change in the number of model parameters (Zimmermann et al., 2002b).

8.3

Dynamical Consistent Neural Networks (DCNN)

market dynamics

overﬁtting

The neural networks described in section 8.2 not only learn from data, but also integrate prior knowledge and ﬁrst principles into the modeling in the form of architectural concepts. However, the question arises if the outlined neural networks are a suﬃcient framework for the modeling of complex nonlinear dynamical systems, which can only be understood by analyzing the interrelationship of diﬀerent subdynamics. Consider the following economic example: The dynamics of the US dollar–euro foreign exchange market is clearly inﬂuenced by the development of other major foreign exchange, stock or commodity markets (Murphy, 1999). In other words, movements of the US dollar–euro foreign exchange rate can only be comprehended by a combined analysis of the behavior of other coherent markets. This means that a model of the US dollar and euro foreign exchange market must also learn the dynamics of related markets and intermarket dependencies. Now it is important to note that, due to their computational power (in the sense of modeling highdimensional nonlinear dynamics), the described medium-sized recurrent neural networks are only capable of modeling a single market’s dynamics. From this point of view an integrated approach of market modeling is hardly possible within the framework of those networks. Hence, we need large neural networks. A simple scaling up of the presented neural networks would be misleading. Our experiments indicate that scaling up the networks by increasing the dimension of the internal state results in overﬁtting due to the large number of free parameters. Overﬁtting is a critical issue, because the neural network does not only learn the underlying dynamics, but also the noise included in the data. Especially in economic applications, overﬁtting poses a serious problem. In this section we deal with architectures which are feasible for large recurrent neural networks. These architectures are based on a redesign of the recurrent neural networks introduced in section 8.2. Most of the resulting networks cannot even be designed with a low-dimensional internal state (see section 8.3.1). In addition, we focus on a consistency problem of traditional statistical modeling: Typically one assumes that the environment of the system remains unchanged when the dynamics is iterated into the future. We show that this is a questionable statistical assumption, and solve the problem with a dynamical consistent recurrent neural network (see section 8.3.2). Thereafter, we deal with large error correction networks and integrate dynamical consistency into this framework (see section 8.3.3). Finally, we point out that large RNNs and large ECNNs are appropriate for diﬀerent types of dynamical systems. Our intention is to merge the diﬀerent characteristics of the two models in a uniﬁed neural network architecture. We call it DCNN for dynamical consistent neural network (see section 8.3.4). Finally we discuss the problem of

8.3

Dynamical Consistent Neural Networks (DCNN)

213

partially known observables (see section 8.3.5). 8.3.1

Normalization of Recurrent Networks

Let us revisit the basic time-delay recurrent neural network of 8.3. The state transition equation st is a nonlinear combination of the previous state st−1 and external inﬂuences ut using matrices A and B. The network output yt is computed from the present state st employing matrix C. The network output is therefore a nonlinear composition applying the transformations A, B, and C. In preparation for the development of large networks we ﬁrst separate the state equation of the recurrent network (eq. 8.3) into a past and a future part. In this framework st is always regarded as the present time state. That means that for this pattern t all states sτ with τ ≤ t belong to the past part and those with τ > t to the future part. The parameter τ is hereby always bounded by the length of the unfolding m and the length of the overshooting n (see sections 8.2.2 and 8.2.3), such that we have τ ∈ {t − m, . . . , t + n} for all t ∈ {m, . . . , T − n} with T as the available number of data patterns. The present time (τ = t) is included in the past part, as these state transitions share the same characteristics. We get the following representation of the optimization problem: τ ≤t:

sτ +1 = tanh(Asτ + c + Buτ )

τ >t:

sτ +1 = tanh(Asτ + c) yτ

= Csτ ,

T −n

t+n

. (yτ − yτd )2 → min

t=m τ =t−m

normalized recurrent networks

(8.8)

A,B,C,c

As shown in section 8.2, these equations can be easily transformed into a neural network architecture (see ﬁg. 8.4). In this model, past and future iterations are consistent under the assumption of a constant future environment. The diﬃculty with this kind of recurrent neural network is the training with backpropagation through time, because a sequence of diﬀerent connectors has to be balanced. The gradient computation is not regular, i.e., we do not have the same learning behavior for the weight matrices in diﬀerent time steps. In our experiments we found that this problem becomes more important for training large recurrent neural networks. Even the training itself is unstable due to the concatenated matrices A, B, and C. As training changes weights in all of these matrices, diﬀerent eﬀects or tendencies—even opposing ones—can inﬂuence them and may superpose. This implies that no clear learning direction or weight changes result from a certain backpropagated error. The question arises of how to redesign the basic recurrent architecture (eq. 8.8) to improve learning behavior and stability especially for large networks. As a solution, we propose the neural network of 8.9, which incorporates besides the bias c only one connector, the matrix A. The corresponding architecture is depicted in ﬁg. 8.6. Note that from now on we change the formulation of the system

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

00

s t−2

Id

s t−1 c

0 0 Id

ut−2

Figure 8.6

00

c

Id

A

yt

A

y t+2

y t+1

00

Id

y t−1

Id

A

st c

00

y t−2

Id

s t+1 c

A

00

214

s t+2 c

0 0 Id

0 0 Id

ut−1

ut

Normalized recurrent neural network.

equations (e.g., eq. 8.8) from a forward (st+1 = f (st , ut )) to a backward formulation (st = f (st−1 , ut )). As we will see, the backward formulation is internally equivalent to a forward model. ⎤ ⎞ ⎛ ⎡ 0 ⎥ ⎟ ⎜ ⎢ ⎥ ⎟ ⎢ τ ≤t: sτ = tanh ⎜ ⎝Asτ −1 + c + ⎣ 0 ⎦ uτ ⎠ Id τ >t:

.

sτ = tanh(Asτ −1 + c) yτ = [Id 0 0]sτ ,

T −n

t+n

(8.9)

(yτ − yτd )2 → min

t=m τ =t−m

A,c

We call this model a normalized recurrent neural network (NRNN). It avoids the stability and learning problems resulting from the concatenation of the three matrices A, B, and C. The modeling is now focused solely on the transition matrix A. The matrices between input and hidden as well as between hidden and output layers are ﬁxed and therefore not learned during the training process. This implies that all free parameters—as they are combined in one matrix—are now treated the same way by backpropagation. It is important to note that the normalization or concentration on only one single matrix is paid for with an oversized (high-dimensional) internal state. At ﬁrst view it seems that in this network architecture (ﬁg. 8.6) the external input uτ is directly connected to the corresponding output yτ . This is not the case, though, because we increase the dimension of the internal state sτ , such that the input uτ has no direct inﬂuence on the output yτ . Assuming that we have a number p of network outputs, q computational hidden neurons, and r external inputs, the dimension of the internal state would be dim(s) ≥ p + q + r. With the matrix [Id 0 0] we connect only the ﬁrst p neurons of the internal state sτ to the output layer yτ . This connector is a ﬁxed identity matrix of appropriate size. Consequently, the neural network is forced to generate the p outputs of the neural network at the ﬁrst p components of the state vector sτ . Let us now focus on the last r state neurons, which are used for the processing

8.3

Dynamical Consistent Neural Networks (DCNN)

large networks

modeling observables

215

of the external inputs uτ . The connector [0 0 Id]T between the externals uτ and the internal state sτ is an appropriately sized ﬁxed identity matrix. More precisely, the connector is designed such that the input uτ is connected to the last state neurons. Recalling that the network outputs are located at the ﬁrst p internal states, this composition avoids a direct connection between input and output. It delays the impact of the externals uτ on the outputs yτ by at least one time step. To additionally support the internal processing and to increase the network’s computational power, we add a number q of hidden neurons between the ﬁrst p and the last r state neurons. This composition ensures that input and output processing of the network are separate. Besides the bias vector c the state transition matrix A holds the only tunable parameters of the system. Matrix A does not only code the autonomous and the externally driven parts of the dynamics, but also the processing of the external inputs uτ and the computation of the network outputs yτ . The bias added to the internal state handles oﬀsets in the input variables uτ . Remarkably, the normalized recurrent network of 8.9 can only be designed as a large neural network. If the internal network state is too small, the inputs and outputs cannot be separated, as the external inputs would at least partially cover the internal states at which the outputs are read out. Thus, the identiﬁcation of the network outputs at the ﬁrst p internal states would become impossible. Our experiments indicate that recurrent neural networks in which the only tunable parameters are located in a single state transition matrix (e.g., eq. 8.9) show a more stable training process, even if the dimension of the internal state is very large. Having trained the large network to convergence, many weights of the state transition matrix will be dispensable without derogating the functioning of the network. Unneeded weights can be singled out by using a weight decay penalty and standard pruning techniques (Haykin, 1994; Neuneier and Zimmermann, 1998). In the normalized recurrent neural network (eq. 8.9) we consider inputs and outputs independently. This distinction between externals uτ and the network output yτ is arbitrary and mainly depends on the application or the view of the model builder instead of the real underlying dynamical system. Therefore, for the following model we take a diﬀerent point of view. We merge inputs and targets into one group of variables, which we call observables. So we now look at the model as a high-dimensional dynamical system where input and output represent the observable variables of the environment. The hidden units stand for the unobservable part of the environment, which nevertheless can be reconstructed from the observations. This is an integrated view of the dynamical system. We implement this approach by replacing the externals uτ with the (observable) targets yτd in the normalized recurrent network. Consequently, the output yτ

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

and the external input yτd have now identical dimensions. ⎤ ⎞ ⎛ ⎡ 0 ⎥ ⎟ ⎜ ⎢ ⎢ 0 ⎥ yτd ⎟ As τ ≤t: sτ = tanh ⎜ + c + τ −1 ⎦ ⎠ ⎝ ⎣ Id τ >t:

.

sτ = tanh(Asτ −1 + c) T −n

yτ = [Id 0 0]sτ ,

t+n

(8.10)

(yτ − yτd )2 → min A,c

t=m τ =t−m

The corresponding model architecture is shown in ﬁg. 8.7.

00

s t−2

Id

s t−1 c

0 0 Id

ydt−2

Figure 8.7

A

00

c

Id

yt

A

Id

st c

0 0 Id

y t+2

y t+1

00

Id

y t−1

A

00

y t−2

s t+1 c

Id

A

00

216

s t+2 c

0 0 Id

ydt−1

ydt

Normalized recurrent net modeling the dynamics of observables yτd .

Note that, because of the one-step time delay between input and output, yτd and yτ are not directly connected. Furthermore, it is important to understand that we now take a totally diﬀerent view of the dynamical system. In contrast to 8.9, this network (eq. 8.10) not only generates forecasts for the dynamics of interest but for all external observables yτd . Consequently, the ﬁrst r state neurons are used for the identiﬁcation of the network outputs. They are followed by q computational hidden neurons, and r state neurons that read in the external inputs. 8.3.2

Dynamical Consistent Recurrent Neural Networks (DCRNN)

The models presented so far are all statistical but not dynamical consistent, as we assume that the environment stays constant for the future part of the network. In the following we improve our models with dynamical consistency. An open dynamical system is partially driven by an autonomous development and partially by external inﬂuences. When the dynamics is iterated into the future, the development of the system environment is unknown. Now, one of the standard statistical paradigms is to assume that the external inﬂuences are not signiﬁcantly changing in the future part. This means that the expected value of a shift in an external input yτd with τ > t is 0 by deﬁnition. For that reason we have so far

8.3

Dynamical Consistent Neural Networks (DCNN)

217

neglected the external inputs yτd in the normalized recurrent neural network at all future unfolding time steps, τ > t (see eq. 8.10). Especially when we consider fast-changing external variables with a high impact on the dynamics of interest, the above assumption is very questionable. In relation to 8.10 it even poses a contradiction, as the observables are assumed to be constant on the input and variable on the output side. Even in case of a slowly changing environment, long-term forecasts become doubtful. The longer the forecast horizon is, the more the statistical assumption is violated. A statistical model is therefore not consistent from a dynamical point of view. For a dynamical consistent approach, one has to integrate assumptions about the future development of the environment into the modeling of the dynamics. For that reason we propose a network that uses its own predictions as replacements for the unknown future observables. This is expressed by an additional ﬁxed matrix in the state equation. The resulting DCRNN is: ⎤ ⎤ ⎡ ⎡ Id 0 0 0 ⎥ d ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ τ ≤t: sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh(Asτ −1 + c) + ⎣ 0 ⎦ yτ ⎡ τ >t:

sτ

0

0 0

Id

⎤

Id 0 0 ⎥ ⎢ ⎢ = ⎣ 0 Id 0 ⎥ ⎦ tanh(Asτ −1 + c) Id 0 0

yτ =

[Id 0 0]

sτ ,

T −n

(8.11)

t+n

(yτ − yτd )2 → min

t=m τ =t−m

Similarly to the end of section 8.3.1, we look at the state vector sτ in a very structured way. The recursion of the state equations (eq. 8.11) acts in the past (τ ≤ t) and future (τ > t) always on the same partitioning of that vector. For all τ ∈ {t − m, . . . , t + n}, sτ can be described as ⎤ ⎡ ⎤ ⎡ expectations yτ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ hidden states hτ ⎥ ⎥=⎢' ( ' ( (8.12) sτ = ⎢ ⎢ τ ≤ t : y d ⎥ ⎢ τ ≤ t : observations ⎥ . ⎦ ⎦ ⎣ ⎣ τ τ > t : expectations τ > t : yτ

state vector

consistency trices

A,c

ma-

This means that in the ﬁrst r components of the state vector we have the expectations yτ , i.e., the predictions of the model. The q components in the middle of the vector represent the hidden units hτ . They are actually responsible for the development of the dynamics. In the last r components of the vector we ﬁnd in the past (τ ≤ t) the observables yτd , which the model receives as external input. In the future (τ > t) the model replaces these unknown future observables by its own expectations yτ . This replacement is modeled with two consistency matrices:

218

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

⎤

⎡

⎤

⎡

Id 0 0

Id 0 0

⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ C≤ = ⎢ ⎣ 0 Id 0 ⎦ and C> = ⎣ 0 Id 0 ⎦ . 0 0 0 Id 0 0

Let us explain one recursion of the state equation (eq. 8.11) in detail: In the past (τ ≤ t) we start with a state vector sτ −1 , which has the structure of 8.12. This vector is ﬁrst multiplied with the transition matrix A. After adding the bias c, the vector is sent through the nonlinearity tanh. The consistency matrix then keeps the ﬁrst r +q components (expectations and hidden states) of the state vector but deletes (by multiplication with zero) the last r ones. These are ﬁnally replaced by the observables yτd , such that sτ again has the partitioning of 8.12. Note that in contrast to the normalized recurrent neural network (eq. 8.10) the observables are now added to the state vector after the nonlinearity. This is important for the consistency structure of the model. The recursion in the future state transition (τ > t) diﬀers from the one in the past in terms of the structure of the consistency matrix and the missing external input. The latter is now replaced with an additional identity block in the future consistency matrix C> , which maps the ﬁrst r components of the state vector, the expectations yτ , to its last r components. Thus we get the desired partitioning of sτ (eq. 8.12) and the model becomes dynamical consistent. Figure 8.8 illustrates this architecture. Note that the nonlinearity and the ﬁnal calculation of the state vector are separate and hence modeled in two diﬀerent layers. This follows from the dynamical consistent state equation (eq. 8.11), in which the observables are added separately from the nonlinear component. Regarding the single transition matrix A, we want to point out that in a statistical consistent recurrent network (eq. 8.10) the matrix has to model the state transformation over time and the merging of the input information. However, the

y t−1 00

s t−1

Id

A c

0 0 Id

ydt−1

non− linear

y t+1

00

Id

yt

C

00

transition matrix

(8.13)

s t+1

0 0 Id

ydt

Dynamical consistent recurrent neural network (DCRNN). At all future time steps of the unfolding the network uses its own forecasts as substitutes for the unknown development of the environment.

Figure 8.8

8.3

Dynamical Consistent Neural Networks (DCNN)

219

network is only triggered by the external drivers up to the present time step t. In a dynamical consistent network we have forecasts of the external inﬂuences, which can be used as future inputs. Thus, the transition matrix A is always dedicated to the same task: modeling the dynamics. 8.3.3

Dynamical Consistent Error Correction NNs (DCECNN)

The ECNN is a nonlinear state space model employing the shared weight matrices A, B, C, and D (eq. 8.6). Matrix A computes the state transformation over time, B processes the external input information, C derives the network output, and D is responsible for the error correction mechanism. The latter composition of nonlinear transformations A, B, C, and D is diﬃcult to handle when the network’s internal state is high dimensional. Therefore we developed a dynamical consistent error correction neural network (DCECNN) of the form of 8.14. It is an analogous approach to the DCRNN (eq. 8.11) and consequently the equations are very similar. The only changes concern the two consistency matrices C≤ and C> . ⎤

⎡ Id

τ ≤t:

τ >t:

⎤

⎡

0 0

0

⎥ ⎥ d ⎢ ⎢ ⎥ ⎥ ⎢ sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh(Asτ −1 + c) + ⎣ 0 ⎦ yτ Id −Id 0 0 ⎤ ⎡ Id 0 0 ⎥ ⎢ ⎥ sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh(Asτ −1 + c) 0 0 0 yτ =

[Id 0 0]

sτ ,

T −n

t+n

(yτ − yτd )2 → min .

t=m τ =t−m

A,c

(8.14) state vector

Due to the error correction, the deﬁnition or partitioning of the state vector diﬀers in the last r components. We now have for all τ ∈ {t − m, . . . , t + n} ⎤ ⎡ ⎤ ⎡ yτ expectations ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ hidden states hτ ⎥ ⎥=⎢' ( ' ( (8.15) sτ = ⎢ ⎢ τ ≤t: e ⎥ ⎢ τ ≤ t : error correction ⎥ . ⎣ ⎦ ⎦ ⎣ τ τ >t: 0 τ >t: 0 In the past part (τ ≤ t) we get the error correction term in the state vector by subtracting the expectations yτ from the observations yτd . This is performed by the negative identity matrix −Id within the consistency matrix C≤ . In the future part (τ > t) we expect that our model is correct. Therefore we replace the error correction by zero. The future consistency matrix C> simply overwrites the last r

220

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

components of the state vector with zero. Analogous to the DCRNN, the internal transition matrix A is only used for the modeling of the dynamics over time. The graphical illustration of a dynamical consistent error correction neural network is identical to the recurrent one (ﬁg. 8.8), but note, that the consistency matrices C≤ and C> have changed their structure. 8.3.4

Dynamical Consistent Neural Networks (DCNN)

Dynamical consistent neural networks(see section 8.3.2) are most appropriate if the observed dynamics is not hidden by noise and evolves smoothly over time, e.g., modeling of a sine curve. However, modeling can only be successful if we know all external drivers of the system and the dynamics is not inﬂuenced by external shocks. In many real-world applications, e.g. trading (see Zimmermann et al. (2002a)), this is simply not true. The dynamics of interest is often covered with noise. External shocks or unknown external inﬂuences disturb the system dynamics. In this case, one should apply DCECNNs (see section 8.3.3), which describe the dynamics with an internal expectation and its deviation from the observables. Now the question arises of whether and how we can merge the diﬀerent model characteristics within a single dynamical consistent neural network (DCNN). There are two diﬀerent ways to set up this combination. In our ﬁrst approach (eq. 8.18) we keep the framework of the DCRNN (eq. 8.11), whereas the second one (eq. 8.22) is based on the DCECNN (eq. 8.14). The ﬁrst approach is based on the DCRNN. Consequently, the state vector sτ has, in the past (τ ≤ t) and future (τ ≤ t) for all τ ∈ {t − m, . . . , t + n}, the partitioning of 8.16 (see also eq. 8.12). ⎤ ⎤ ⎡ ⎡ expectations yτ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ hidden states hτ ⎢ ⎢ ⎥ (⎥ (⎥=⎢' (8.16) sτ = ⎢ ' ⎥ d ⎦ ⎣ τ ≤ t : observations ⎦ ⎣ τ ≤ t : yτ τ > t : expectations τ > t : yτ In comparison to the DCRNN (eq. 8.11) the recursion of the new model (eq. 8.18) is extended by an additional consistency matrix ⎤ ⎡ 0 0 Id ⎥ ⎢ ⎥ (8.17) C=⎢ ⎣ 0 Id 0 ⎦ −Id 0 Id

between the state vector and the transition matrix A. As we will see, this matrix ensures that the model is supplied with the information of the observables yτd as well as the error corrections eτ . We call this approach DCNN1 (eq. 8.18). The

8.3

Dynamical Consistent Neural Networks (DCNN)

221

corresponding network architecture is depicted in ﬁg. 8.9. ⎤ ⎛ ⎡ ⎤ ⎤ ⎡ ⎞ ⎡ 0 0 Id 0 Id 0 0 ⎥ ⎜ ⎢ ⎥ d ⎥ ⎢ ⎟ ⎢ ⎥ ⎜ ⎢ ⎥ ⎥ ⎟ ⎢ τ ≤ t : sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 ⎦ sτ −1 + c⎠ + ⎣ 0 ⎦ yτ −Id 0 Id Id 0 0 0 ⎤ ⎛ ⎡ ⎤ ⎡ ⎞ 0 0 Id Id 0 0 ⎥ ⎜ ⎢ ⎥ ⎢ ⎟ ⎢ ⎜ ⎥ ⎢ ⎟ s τ > t : sτ = ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 ⎥ + c τ −1 ⎦ ⎠ −Id 0 Id

Id 0 0 yτ =

[Id 0 0]

T −n

sτ ,

t+n

(yτ − yτd )2 → min A,c

t=m τ =t−m

(8.18) inner state vector

To describe how the model evolves, we explain the state equations step by step: We start with a state vector sτ −1 which has the structure of 8.16. Through the multiplication with the additional consistency matrix C the state vector is transformed into a vector with the partitioning ⎤ ⎤ ⎡ ⎡ observations yτd ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ (8.19) s˜τ = ⎢ ⎣ hτ ⎦ = ⎣ hidden states ⎦ error correction eτ for all τ ∈ {t − m, . . . , t + n}. This inner state vector s˜τ contains the observables and the error correction and combines the ideas of DCRNN and DCECNN. The rest of the recursion is identical with the DCRNN (eq. 8.11). As before, the only learnable parameters of the network are located in matrix A and the bias c.

00

Id

s t−1

A

C linear

c 0 0 Id

DCNN2

C

s t+1

0 0 Id

ydt−1

Figure 8.9

non− linear

y t+1

00

Id

yt

00

y t−1

ydt

Dynamical consistent neural network (DCNN).

As already mentioned, the second approach to a dynamical consistent neural network (DCNN2) is based on the DCECNN model (dq. 8.14). The state vector sτ

222

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

assumes the corresponding structure (see eq. 8.15): ⎤ ⎤ ⎡ ⎡ expectations yτ ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ ⎢ hidden states hτ ⎥=⎢' ( ⎥. ( ' sτ = ⎢ ⎥ ⎢ ⎥ ⎢ τ ≤ t : eτ ⎦ ⎣ τ ≤ t : error correction ⎦ ⎣ τ >t: 0 τ >t:0

(8.20)

Analogous to the development of the DCNN1 (eq. 8.18) the DCECNN equation is extended by an additional consistency matrix C, which now has the structure ⎤ ⎡ Id 0 Id ⎥ ⎢ ⎥ (8.21) C=⎢ ⎣ 0 Id 0 ⎦ . 0

0 Id

The resulting DCNN2 can be described with the following set of equations: Id

⎤

⎛ ⎡

⎤

⎡

⎜ ⎢ ⎥ ⎢ ⎥ tanh ⎜A ⎢ 0 Id τ ≤ t : sτ = ⎢ 0 Id 0 ⎝ ⎣ ⎦ ⎣ 0 0 −Id 0 0 ⎛ ⎡ ⎤ ⎡ Id 0 Id 0 0 ⎜ ⎢ ⎥ ⎢ ⎢ ⎜ ⎥ ⎢ τ > t : sτ = ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 0 0 0 0 yτ

=

[Id 0 0]

⎞

sτ ,

T −n

⎤

⎡

Id 0 Id

0 0

0

⎥ ⎥ ⎟ ⎢ ⎟ + ⎢ 0 ⎥ yτd s + c 0 ⎥ τ −1 ⎦ ⎦ ⎠ ⎣ Id Id ⎤ ⎞ Id ⎥ ⎟ ⎟ 0 ⎥ ⎦ sτ −1 + c⎠ Id

t+n

(yτ − yτd )2 → min

t=m τ =t−m

A,c

(8.22)

advantages of the DCNN

Looking at the multiplication C · sτ −1 we can easily conﬁrm that—supposing that sτ −1 is structured as in 8.20—we once again get an inner state vector s˜τ partitioned as in 8.19. This implies that the transition matrix A is applied in both models to the same inner state vector s˜τ . Consequently, although the two models look quite diﬀerent, they share an identical modeling of the dynamics. It may depend on additional modeling tools or a particular application which approach is preferable. The network architecture for the alternative approach, DCNN2, is identical to DCNN1 (ﬁg. 8.8), but note that the consistency matrices C, C≤ , and C> diﬀer. Opposite to the DCRNN (eq. 8.11) and DCECNN (eq. 8.14) the two approaches to the DCNN (eqs. 8.18 and 8.22) compute the state trajectory of the dynamics in the past exactly on the observed path. This follows from the partitioning of the inner state vector s˜τ (eq. 8.19), which is responsible for the calculation of the dynamics. It contains the observables in the ﬁrst r components, which are directly used to determine the prediction yτ . The error corrections, which are now located in the last r components, act as additional inputs. Furthermore, the DCNN oﬀers an

8.3

Dynamical Consistent Neural Networks (DCNN)

223

interesting new insight into the observation of dynamical systems: Typically, small movements of the dynamics are treated as noise, and thus the modeling focuses on larger shifts in the dynamics. Our view is diﬀerent. We believe that small system changes characterize the autonomous part of our open system, while the large swings originate at least partially from the external forces. If we neglect small system changes, we also suppress valuable substructure in our observations. We found that DCNNs allow us to model even small changes in the dynamics without losing the generalization abilities of the model. This introduces a new perspective on the structure/noise dilemma in modeling dynamical systems. 8.3.5

Partially Known Observables

So far our models have always distinguished between a past and a future development of the state equation. We assumed that in the past part (τ ≤ t) all the identiﬁed observables are available. In the future part (τ > t) we accepted that we do not know anything about the observables and hence we replaced them by the model’s own expectations. In many practical applications we have observables which are not available for all time steps in the past. In contrast, one might have observables which are also available in the future, e.g., calendar data. In the following we therefore switch from a model diﬀerentiating between past and future to a modeling structure which distinguishes between available and missing external inputs. The DCNN with partially known observables merges the two state equations of the DCNN (e.g., eq. 8.22) into one single equation that allows us to diﬀerentiate between available and unavailable observables. Consequently, it is a reformulation of the normal DCNN providing an easier and more general structure. The simpliﬁcation in one equation makes the model also more tractable for further discussions (see sections 8.4 and 8.5). The following model (eq. 8.23) is based on the DCNN2 (eq. 8.22), but an analogous model can also easily be created for DCNN1 (eq. 8.18). For all τ ∈ {t − m, . . . , t + n}, we have ⎤ ⎤ ⎛ ⎡ ⎤ ⎞ ⎡ ⎡ Id 0 Id 0 Id 0 0 ⎥ ⎜ ⎢ ⎥ E ⎥ ⎟ ⎢ ⎢ ⎥ ⎜ ⎢ ⎥ ⎥ ⎟ ⎢ sτ = ⎢ ⎣ 0 Id 0 ⎦ tanh ⎝A ⎣ 0 Id 0 ⎦ sτ −1 + c⎠ + ⎣ 0 ⎦ yτ E yτ =

0

0 0

[Id 0 0]

sτ ,

0 Id

T −n

t+n

Id

(8.23)

(yτ − yτd )2 → min .

t=m τ =t−m

A,c

In this model the external inputs yτE and the included matrix E are deﬁned as follows: ' ( 0 input missing E if (8.24) yτ := input available yτd

224

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

and

' E :=

0 −1

if

(

input missing

.

input available

(8.25)

It is important to note that the inner consistency matrix C is independent of the input availability. We only adapt the consistency matrix ⎤ ⎡ Id 0 0 ⎥ ⎢ ⎥ (8.26) CE = ⎢ ⎣ 0 Id 0 ⎦ . E

0 0

The structure guarantees that an error correction is calculated in the last r state components if external input is available. Thus, we have a time-independent combination of the former two state equations (eq. 8.22). The corresponding model architecture (ﬁg. 8.10) does not change signiﬁcantly in comparison to the former (time-oriented) DCNN (ﬁg. 8.9).

00

Id

s t−1

C

A linear

c 0 0 Id

CE

Id

st

C

A linear

c

non− linear

CE

s t+1 0 0 Id

0 0 Id

yEt−1

Figure 8.10

non− linear

y t+1

00

Id

yt

00

y t−1

yEt

yEt+1

DCNN with partially known observables.

The DCNN with partially known observables is more general in the sense of observable availability and hence better applicable to real-world problems. The following discussions are mainly based on this model.

8.4

Handling Uncertainty In practical applications our models have to cope with several forms of uncertainty. So far we have neglected their possible inﬂuence on generalization performance. Uncertainty can disturb the development of the internal dynamics and seriously harm the quality of our forecasts. In this section we present several methods which reduce the model’s dependency on uncertain data. There are actually three major sources of uncertainty. First, the input data itself might be corrupted or noisy. We deal with that problem in section 8.4.1. In the framework of ﬁnitely unfolded in time recurrent neural networks we also have

8.4

Handling Uncertainty

225

the uncertainty of the initial state. We present diﬀerent approaches to overcome that uncertainty and achieve a desensitization of the model from the unknown initialization (see section 8.4.2). Finally, we discuss the uncertainty of the future inputs and question once more the assumption of a constant environment (see section 8.4.3). 8.4.1

cleaning noise

Handling Data Noise

So far we have always assumed our input data to be correct. In most practical applications this is not true. In the following we present an approach which tries to minimize input uncertainty. Cleaning noise is a method which improves the model’s learning behavior by correcting corrupted or noisy input data. The method is an enhancement of the cleaning technique which is described in detail in (Neuneier and Zimmermann, 1998). In short, cleaning considers the inputs as corrupted and adds corrections to the inputs if necessary. However, we want to keep the cleaning correction as small as possible. This leads to an extended error function Ety,x =

1 [(yt − ytd )2 + (xt − xdt )2 ] = Ety + Etx 2

→ min xt ,w

.

(8.27)

Note that this new error function does not change the usual weight adaption rule w+ = w − η

∂E y ∂w

,

(8.28)

where η > 0 is the so-called learning rate and w+ stands for the adapted weight. To calculate the cleaned input xt = xdt + ρt

(8.29)

we need the correction vectors ρt for all input data of the training set. The update rule for these corrections, initialized with ρt = 0, can be derived from typical adaption sequences: x+ t = xt − η

∂E y,x , ∂x

(8.30)

leading to ρ+ t = (1 − η)ρt − η

∂E y ∂x

.

(8.31)

This is a nonlinear version of the error-in-variables concept from statistics. y,x We derive all the information needed, especially the residual error ∂E∂x , from training the network with backpropagation (ﬁg. 8.18), which makes the computational eﬀort negligible. It is important to note that in this way the corrections are performed by the model itself and not by applying external knowledge (see “observer-observation dilemma” in Neuneier and Zimmermann, 1998). We now assume that the data is not only corrupted but also noisy. For that

226

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

reason we add an extra noise vector, −ρτ , to the cleaned value: xt = xdt + ρt − ρτ

.

The noise vector ρτ is a randomly chosen row vector matrix ⎡ ρ11 · · · · · · · · · ⎢ . ⎢ ⎢ ρ12 . . ⎢ ⎢ . CCleaning := ⎢ .. ρit ⎢ ⎢ . .. ⎢ .. . ⎣ ρ1T · · · · · · · · ·

local noise

cleaning

(8.32) {ρiτ }i=1,... ,r of the cleaning ρr1 ρr2 .. . .. .

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎦

ρrT

which stores the input error corrections of all data patterns. The matrix has the same size as the pattern matrix, as the number of rows equals the number of patterns T and the number of columns equals the number of inputs r. One might wonder why disturb the cleaned input xt = xdt + ρt with an additional noise-term −ρτ . The reason for this is, that we want to beneﬁt from representing the whole input distribution to the network instead of only using one particular realization (Zimmermann and Neuneier, 1998). A variation on the Cleaning Noise method is called local cleaning noise. Cleaning noise adds to every training pattern the same noise term −ρτ and therefore assumes that the noise of the diﬀerent inputs is correlated. Especially in highdimensional models it is improbable that all the components of the input vector follow an identical or at least correlated noise distribution. For these cases we propose a method which is able to diﬀerentiate component-wise: xit = xdt + ρt − ρiτ .

(8.33)

In contrast to the normal cleaning technique, the local version is correcting each component of the input vector xit individually by a cleaning correction and a randomly taken entry ρiτ of the corresponding column {ρit }t=1,... ,T of the cleaning matrix CCleaning . A further advantage of the local cleaning technique is that—with the increased number of (local) correction terms (T · r)—we can cover higher dimensions. In contrast, with the normal cleaning technique the dimension is bounded by the number of training patterns T , which can be insuﬃcient for high-dimensional problems. 8.4.2

Handling the Uncertainty of the Initial State

One of the diﬃculties with ﬁnite unfolding in time is to ﬁnd a proper initialization for the ﬁrst state vector of the recurrent neural network. An obvious solution is to set the ﬁrst state s0 to zero. We then implicitly assume that the unfolding includes

enough (past) time steps such that the misspeciﬁcation of the initialization phase is compensated along the state transitions. In other words, the network accumulates information over time, and thus can eliminate the impact of the arbitrary initial state on the network outputs. The model can be improved if we make the unfolded recurrent network less sensitive to the unknown initial state s0 . For this purpose we look for an initialization, for which the interpretation of the state recursion is consistent over time. Since the initialization procedure is identical for all types of DCNNs, we demonstrate the approach on the DCNN with partially known observables (eq. 8.23): ⎤ ⎡ 0 ⎥ E ⎢ ⎥ sτ = CE tanh(A · C · sτ −1 + c) + ⎢ ⎣ 0 ⎦ yτ Id T −n

yτ = [Id 0 0]sτ ,

t+n

(8.34)

(yτ − yτd )2 → min . A,c

t=m τ =t−m

In a ﬁrst step we explicitly integrate a ﬁrst state vector s0 (see ﬁg. 8.11). This target of the ﬁrst output is no longer set to zero but receives the target information yt−1 vector yt−1 . The vector is then multiplied by the consistency matrix CE , such that the ﬁrst r components of the ﬁrst state vector st−1 coincide with the ﬁrst expected output. This avoids the generation of an excessively large error for the ﬁrst output.

y t−1 Id

s0

CE

yt

00

initializa-

227

s t−1

Id

C

A linear

c Id 0 0

ytar t−1

Figure 8.11 state s0 .

0 0 Id

yEt−1

y t+1

00

state tion

Handling Uncertainty

non− C E

Id

st

linear

C

A linear

c

non− C E linear

00

8.4

s t+1

0 0 Id

yEt

Time consistent initialization of a DCNN with an additional initial

The hidden states of this model are arbitrarily initialized with zero. In a second step we add a noise term ε to the ﬁrst state vector s0 to stiﬀen the model against the uncertainty of the unknown initial state. A ﬁxed noise term ε that is drawn from a predetermined noise distribution is clearly inadequate to handle the uncertainty of the initial state. Instead we apply—according to the cleaning noise method—an adaptive noise term, which ﬁts best the volatility of the unknown initial state s0 . As explained in section 8.4.1, the characteristics of the adaptive noise term are automatically determined as a by-product of the error backpropagation algorithm.

228

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

residual error

The basic idea is as follows: The residual error ρ as measured at the initial state s0 can be interpreted as the uncertainty stemming from missing information about the true initial state vector. If we disturb s0 with a noise term which follows the distribution of the residual error of the network, we diminish the uncertainty about the unknown initial state during system identiﬁcation. In addition, this allows a better ﬁtting of the target values over the training set. A corresponding network architecture is depicted in ﬁg. 8.12.

Id 0

s0

noise

CE

00

Id

s t−1

Id

C

A linear

c 0 0 Id

Id 0 0

ytar t−1

Figure 8.12

initialization techniques

yEt−1

non− linear

y t+1

00

0

yt

CE

Id

st

C

A linear

c

non− linear

CE

00

y t−1

s t+1

0 0 Id

yEt

Desensitization of a DCNN from the unknown initial state s0 .

Technically, noise is introduced into the model via an additional input layer. The dimension of noise is equal to that of the internal state. The input values are ﬁxed at zero over time. Due to the incomplete identity matrix between noise and the initial state the noise is only applied to the hidden values of the initial state, where no input information is available. The desensitization of the network from the initial state vector s0 can therefore be seen as a self-scaling stabilizer of the modeling. Note that the noise term ρ is drawn randomly from the observed residual errors, without any prior assumption on the underlying noise distribution. In general, a discrete-time state trajectory forms a sequence of points over time. Such a trajectory is comparable to a thread in the internal state space. The trajectory is very sensitive to the initial state vector s0 . If we apply noise to s0 , the space of all possible trajectories becomes a tube in the internal state space (ﬁg. 8.13). Due to the characteristics of the adaptive noise term, which decreases over time, the tube contracts. This enforces the identiﬁcation of a stable dynamical system. Consequently, the ﬁnite volume trajectories act as a regularization and stabilization of the dynamics. The question arises, what may be the best method to create an appropriate noise level. Table 8.1 gives an overview of several initialization techniques we have developed and examined so far. Remember that in all cases the corrections are only applied to the hidden variables of the initial state s0 . We already explained the ﬁrst three methods in section 8.4.1. The idea behind the initialization with start noise is, that we do not need a cleaning correction but solely focus on the noise term. double start noise tries to achieve a nearly symmetrical noise distribution, which is also double in comparison to normal start

8.4

Handling Uncertainty

229

st

steps of unfolding in time

Figure 8.13

Creating a tube in the internal state space by applying noise to the

initial state. Table 8.1

Overview of Initialization Techniques Cleaning:

s0

=

0 + ρt

Cleaning noise:

s0

=

0 + ρt − ρτ

Local Cleaning noise:

s0i

=

0 + ρ t − ρ τi

Start noise:

s0

=

0 + ρτ

Local start noise:

s0i

=

0 + ρ τi

Double start noise:

s0

=

0 + (ρ1τ − ρ2τ )

Double local start noise:

s0i

=

0 + (ρ1τi − ρ2τi )

noise. In all cases local always corresponds to the individual application of a noise term to each component of the initial state s0 (see local cleaning noise in section 8.4.1). From top to bottom the methods listed in table 8.1 use less and less information about the training set. Hence double start noise emphasizes more the generalization abilities of the model. This is also conﬁrmed by our experiments. Furthermore we could conﬁrm that the local initialization techniques lead to better performance in high-dimensional models (see section 8.4.1). 8.4.3

Handling the Uncertainty of Unknown Future Inputs

In the past part of the network, the inﬂuence of the unknown externals is reﬂected in the error corrections as calculated by the backpropagation algorithm. In the future part we do not have any information about the correctness of our inputs. As explained in section 8.3, we either use our own forecasts as future inputs, or simply assume that the inputs in the future stay constant. The underlying assumption is that the observables evolve in the future like they did in the past. We cannot verify if this is correct. Anyway, for most practical applications it is a very questionable assumption.

230

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

To stabilize our model against these uncertainties of the future inputs we apply a Gaussian noise term εt+τ to the last r components of each future state vector st+τ . The corresponding architecture is depicted in ﬁg. 8.14.

Id 0

s0

noise

CE

00

Id

s t−1

Id

C

A linear

c Id 0 0

ytar t−1

0 0 Id

yEt−1

non− linear

y t+1

00

0

yt

CE

Id

st

C

A linear

c 0 0 Id

non− linear

CE

00

y t−1

s t+1 0 0 Id

yEt

ε t+1

Figure 8.14 Handling the uncertainty of the future inputs by adding a noise term εt+τ to each future state vector st+τ .

The additional noise is used during the training of the model to achieve a more stable output. For the actual deterministic forecast we either skip the application of noise to avoid a disturbance of the predictions or average our results over a suﬃcient number of diﬀerent forecasts (Monte Carlo approach).

8.5

Function and Structure in Recurrent Neural Networks

superposition and conservation of information

Our discussion about function and structure in recurrent neural networks is focused on the autonomous part of the model, which is mapped by the internal state transition matrix A. So far the transition matrix A has always been assumed to be fully connected. In a fully connected matrix the information of a state vector st is processed using the weights in A to compute st+1 . This implies that there is a high proportion of superposition (computation) but hardly any conservation of information (memory) from one state to a succeeding one (see the right panel of ﬁg. 8.15). For the identiﬁcation of dynamical systems such memory can be essential, as information may be needed for computation in subsequent time steps. A shift register (see the left panel of ﬁg. 8.15) is a simple example for the implementation of memory, as it only transports information within the state vector s. No superposition is performed in this transition matrix. At ﬁrst view we have two contradicting functions: superposition and conservation of information. Superposition of information is necessary to generate or adapt changes of the dynamics. In contrast, conservation of information causes memory eﬀects by transporting information more or less unmodiﬁed to a subsequent state neuron. In this context, memory can be deﬁned as the average number of state transitions necessary to transmit information from one state neuron to any other

8.5

Function and Structure in Recurrent Neural Networks

231

s t+3 A s t+2 memory = conservation of information

0 1 1

a ij = 0

1 0

computation = superposition of information

s t+1 A st Figure 8.15 Function and structure in dynamical systems: computation versus memory in the transition matrix A.

one in a subsequent state. We call this number of necessary state transitions the path length of a neuron. To overcome the apparent dilemma between superposition and conservation of information the transition matrix A needs a structure which balances memory and computation eﬀects. Sparseness of the transition matrix reduces the number of paths and the computation eﬀect of the network but at the same time increases the average path length, and therefore allows for longer-lasting memory. A possible solution is an inﬂation of the recurrent network, i.e., of the transition matrix A. We show that with such an inﬂation an optimal balance between memory and computation can be achieved (section 8.5.1). In this context we present conjectures about the optimal level of sparseness and the required minimum dimension. An experiment with artiﬁcial data underlines our results (section 8.5.2). In section 8.5.3 we conclude that sparseness is actually an essential condition for high-dimensional neural networks. Finally, we discuss in section 8.5.4 the information ﬂow in sparse networks. 8.5.1

Inﬂation of Recurrent Neural Networks

Based on the length of the past unfolding m (see section 8.2.2) and the optimal state dimension dim(s) of a fully connected recurrent network, we can deﬁne a procedure for an optimal design of the neural network structure, which solves the dilemma between memory and computation. The idea is to inﬂate the network to a higher dimensionality, while maintaining the computational complexity of the former lower-dimensional and fully connected network, and at the same time allowing for memory eﬀects. With an inﬂated transi-

232

optimal inﬂation

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

tion matrix A we can optimize both superposition and conservation of information. To determine the optimal dimension and the level of sparseness, we propose two conjectures, which we will empirically investigate in section 8.5.2. In a ﬁrst step we calculate the new dimension of the internal state s by dim(snew ) := m · dim(s).

(8.35)

As the former dimension of s was supposed to be optimal, we have to ensure that the higher-dimensional network has the same superposition of information as the original one. This can be achieved by keeping the number of active weights constant. On average we want to have the same number of nonzero elements as in the former lower-dimensional network. Thus, the sparseness level of the new matrix Anew is given by dim(s) 1 initialize Anew with Random = Random . (8.36) dim(snew ) m

training procedure for RNNs

Hereby Random(·) represents the percentage of randomly initialized weights, whereas the remaining weights are set to zero. Proceeding this way, we replicate on average the computation eﬀect of the former network. At the same time we increase the path lengths (memory) with the sparseness level of the new transition matrix Anew . Note that the sparseness level only depends on the length of the past unfolding m. The conjecture (eq. 8.36) implies that the sparseness of Anew is generated randomly. In section 8.5.3 we present techniques which try to optimize the sparse structure and consequently the memory and computation abilities of the network. Based on our conjectures about inﬂation, a proper training procedure for recurrent neural networks should consist of four steps: First, one has to set up an appropriate network architecture (e.g., DCNN, eq. 8.23). Second, the length of the past unfolding m and the optimal internal state dimension dim(s) of the system have to be estimated by analyzing the network errors along the time steps of the unfolding (see section 8.2.2). Third, we use the estimated parameters m and dim(s) to determine the optimal dimensionality and sparseness (eqs. 8.35 and 8.36). Fourth, the inﬂated network is trained until convergence by backpropagation through time using, e.g., the vario-eta learning rule (Neuneier and Zimmermann, 1998). 8.5.2

Experiments: Testing Conjectures About Inﬂation

In the following experiments we want to evaluate our conjectures about optimal inﬂation of recurrent networks. To ensure a straight analysis of our proposed equations (eqs. 8.35 and 8.36), we modeled an artiﬁcial network which consists of an

Function and Structure in Recurrent Neural Networks

233

autonomous development only. We applied the network to forecast the development of the following artiﬁcial data generation process st = tanh(A · st−m ) + t ,

(8.37)

where dim(s) = 5, A is randomly initialized, m = 3, and is white noise with σ = 0.2, one time step ahead. The unfolding in time of the recurrent network includes ﬁve time steps from t−3 to t+1. As the data generation process is a closed dynamical system, there are no inputs, but time-delayed states st−k (k = 1, . . . , m) are used as external inﬂuences. Each of the following experiments is based on 100 Monte Carlo simulation runs. For each run we generated 1000 observations, 25% for training and 75% for testing purposes. The network was trained until convergence with error backpropagation through time using vario-eta learning (Neuneier and Zimmermann, 1998). First we evaluated our conjecture about the sparse random initialization of the transition matrix Anew . For this purpose we randomly initialized matrix Anew with diﬀerent levels of sparseness (100 test runs per sparseness degree). The dimension of the internal state was ﬁxed according to 8.35 at dim(snew ) = 3 · 5 = 15 for all test runs. The mean square error of the network measured on the test set was used as a performance criterion.

−3

5.9

x 10

5.8

generalization error

8.5

5.7

5.6

5.5

5.4

5.3 0

10

20

30

40

50

60

70

80

90

100

% of random initialization

Figure 8.16

Eﬀects of diﬀerent degrees of sparseness on matrix Anew .

The results of the experiment (Figure 8.16) conﬁrm our conjecture about an optimal sparseness level: If we initialize matrix Anew randomly 35% sparse, we observe the best performance (i.e., lowest average error on the test set). This corresponds to equation 8.36, where an optimal sparseness level computes to 33%.

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

The second series of experiments was connected with the optimal internal state dimension. During these experiments we kept the sparseness level constant at 33% (see eq. 8.36), whereas the dimension of the internal state was variable. We performed 100 runs for each dimension of the internal state with diﬀerent random initializations of matrix Anew . Again we used the network error as an indicator of the model performance. The results are shown in ﬁg. 8.17.

−3

6.1

x 10

6

5.9

generalization error

234

5.8

5.7

5.6

5.5

5.4

5.3

5.2 5

10

15

20

25

30

35

40

45

50

dimension of internal state

Figure 8.17

Impacts of diﬀerent internal state dimensions dim(snew ).

It turns out that the best performance is achieved if the dimension of the internal state is equal to dim(snew ) = 18. Our conjecture of dim(snew ) = 3 · 5 = 15 (eq. 8.35) slightly underestimates the empirically measured optimal dimensionality. However, because of the noise term t , we suppose that the optimal dimension of the system is larger than 5. This indicates that our conjecture in 8.35 is a helpful estimate of an optimal level of sparseness. Both experiments show that mismatches between dimensionality and sparseness cause problems in the function (superposition and conservation) of the transition matrix. In other words, an unbalanced parameterization of the inﬂation leads to lower generalization performance of the network. 8.5.3

Sparseness as a Necessary Condition for Large Systems

One might come up with the idea of initializing a model with a fully connected transition matrix A, and then pruning it during the learning process until a desired degree of sparseness is reached. This approach is misleading, as sparseness is an essential condition for the performance of the backpropagation algorithm in large networks.

8.5

Function and Structure in Recurrent Neural Networks

235

target dev2 = out2 − target

out2 = f(netin2 )

Output δ 2 = f’(netin2 ) dev2

netin 2= A out1 d Et = δ 2 out T1 dA

dev1 = AT δ 2

out1 = f(netin1 )

Hidden δ 1 = f’(netin1 ) dev1

netin 1= A out0 d Et = δ 2 out T1 dA

dev0 = AT δ 1

out 0 = input

Input

input x

Figure 8.18

δE δx

Forward and backward information ﬂow in the backpropagation

algorithm. backpropagation

sparse initialization of A

Figure 8.18 shows the forward and backward information ﬂow in the backpropagation algorithm. 3 When we look at the calculations, it becomes obvious that if the transition matrix A is fully connected and dim(s) is increasing, we get a growing number as well as a lengthening of the sums in the matrix times vector operations. Due to the law of large numbers the probability for large sum values also increases. This does not pose any problems in the forward ﬂow of the algorithm. The hyperbolic tangent as the nonlinear activation function guarantees that the calculated values stay numerically tractable. In contrast, backward information ﬂow is linear. In this part of the algorithm large values are spread all over a fully connected matrix A. They quickly sum up to values which cause numerical instabilities and may destroy the whole learning process. This can be avoided if we use a sparse transition matrix A. The number of summands is then smaller and therefore the probability of large sums is low. In the remainder of this section we want to discuss the question of how to choose a sparse transition matrix A that is still trainable to a stable model. One intuitive answer is to initialize the model several times and then compare the diﬀerent results. We performed 100 test runs of the neural network using diﬀerent random initializations of matrix A. The prestructuring of the network followed our conjectures about inﬂation (eqs. 8.35 and 8.36). An obvious approach to overcome the uncertainty of the random initialization is to pick the best-performing network out of the 100 test runs. However, it is

236

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

not clear if 100 test runs are eﬀectively required to ﬁnd an appropriate model. To study how many test runs are needed to ﬁnd a good solution with a minimum of computational eﬀort, we picked all possible subsets of k = 1, 2, . . . , 100 solutions out of the 100 test runs. From each subset we chose the solution with the lowest error on the test set and computed the average performance: 1 min(ei1 , ei2 , . . . , eik ). (8.38) Ek = 100 k

i1 ≤...≤ik

The resulting average error curve for k = 1, 2, . . . , 100 is depicted in ﬁg. 8.19 (solid line).

−3

x 10

5.5

5 0

Figure 8.19

coverage percentage

1

0.5

10

20

30

40 50 60 Number of Test Runs

70

80

90

c(n)

Average Error (test set)

6

0 100

Estimating the number of random initializations for matrix A.

As can be seen from the error curve in ﬁg. 8.19, an appropriate solution can be obtained on the average by choosing the best model out of a subset of 10 networks (vertical dotted line). Of course, the performance is worse than picking the best model out of 100 solutions. However, the additional computational eﬀort does not justify the small improvement of performance. As an apparently easier guideline to determine the number of required test runs, we choose the number k such that the so-called coverage percentage, k 1 , (8.39) c(k) = 1 − 1 − m is close to 1. The idea behind the coverage percentage c(k) in 8.39 is that the ﬁrst inﬂated network covers c(1) = 1/m active elements in the internal transition matrix Anew . Assuming that we have c(k), the next initialization covers another percentage of the weights in the transition matrix, resulting in c(k+1) = c(k)+(1/m)·(1−c(k)). The coverage percentage c(k) for the diﬀerent numbers of initializations is also

8.5

Function and Structure in Recurrent Neural Networks

pruning creation

&

re-

237

reported in ﬁg. 8.19 (dashed line). A number of k = 10 random initializations already leads to a coverage of c(10) ≈ 0.983. To further reduce the computational eﬀort, we developed a more sophisticated approach, a process we call pruning and re-creation of weights. As described in section 8.5.1, we initialize matrix A with a sparseness level of Random(1/m) (eq. 8.36). The idea is now to optimize the initial sparse structure by alternating weight pruning and re-creation. Using this method, matrix A is always sparse and the number of active weights stays constant. The network still gets the opportunity to replace active weights by initially inactive ones that it considers more important for the identiﬁcation of the dynamics. For the ﬁrst step, the weight pruning, we use a test criterium similar to optimal brain damage (OBD) (LeCun et al., 1990): testw (w = 0) =

∂2E 2 w . ∂w2

(8.40)

We prune a certain percentage (e.g., 5%) of the lowest values, as these weights w are assumed to be less important for the identiﬁcation of the dynamics. To simplify our calculations we use 1 2 ∂2E ≈ g , (8.41) ∂w2 T t t t with gt := ∂E ∂w , as an approximation for the second derivative. Our simulations showed that this equivalence holds for a 95% level. In the second step, the re-creation of inactive weights, we use the following test: 1 (8.42) gt . testw (w = 0) ∼ T

t

We reactivate the weights w with the highest test values. This implies that we recover weights whose average of the absolute gradient information is high and which are therefore considered important for the identiﬁcation of the dynamics. Note that we always re-create the same amount of weights we pruned in the ﬁrst step to keep the sparseness level of the transition matrix A constant. Our experiments showed that we can even prune and re-create weights simultaneously without losing modeling ability. 8.5.4

Information Flow in Sparse Recurrent Networks

In small networks with a full transition matrix A the information of a state neuron can reach every other one within one time step. This is diﬀerent in (large) sparse networks, where state neurons have a longer path length on average. As matrix A is sparse there is in most cases no direct connection between diﬀerent state neurons. Hence, it can take several state transitions to transport information from one state neuron to another. As information might not reach a

desired neuron in a limited number of time steps, this can be disadvantageous for the modeling ability of the network. The resulting question is, how we can speed up the information ﬂow, i.e., shorten the path length? In a simple recurrent network (e.g., eq. 8.9) the transition matrix A is applied once in every state transition. The idea is now to reduce the average path length with at least one additional undershooting step (Zimmermann and Neuneier, 2001). Undershooting means that we implement intermediate states sτ ± 12 which improve the computation of the network (ﬁg. 8.20). These intermediate states have no external inputs. Like future unfolding time steps, they are only responsible for the development of the dynamics and therefore also improve numerical stability.

00

c

s t−2

Id

A

s t−3/2

A

Id

s t−1

A

s t−1/2

A

Id

st

0 0 Id

ut−2

y t+2

y t+1

A

s t+1/2

A

s t+1

c

c

c 0 0 Id

yt

00

Id

y t−1 00

y t−2

00

undershooting

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

Id

A

s t+3/2

A

00

238

s t+2

c

0 0 Id

ut−1

ut

Undershooting improves the computation of a sparse matrix A and the numerical stability of the model.

Figure 8.20

The following formula gives a rough approximation of how many undershooting steps k are needed (eq. 8.43). As the state transition matrix A in the inﬂated network has, per deﬁnition (eq. 8.36), a sparseness of 1/m, in each time step every state neuron only gets the information of approximately 1/m others. The equation now determines the number of undershooting steps which are needed to achieve a desired average path length (information ﬂow) between all state neurons. The kth power of the product of the sparseness factor 1/m and the dimension of the state vector dim(s) must be higher than the number of state neurons ( = dim(s)): k 1 dim(s) ≥ dim(s) m 1 ⇒ k ≥ (8.43) log(m) 1 − log(dim(s)) ⇒ undershooting with DCNN

k ≥ 1+

log(m) . log(dim(s))

Let us reconsider the equations of the DCNN with partially known observables

8.6

Conclusion

239

(eq. 8.23): ⎤

⎡ 0

⎥ E ⎢ ⎥ yτ sτ = CE tanh (A · C · sτ −1 + c) + ⎢ 0 ⎦ ⎣ Id yτ = [Id 0 0]sτ , (yτ − yτd )2 → min .

(8.44)

A,c

t,τ

Following the principle of undershooting, we add a state sτ − 12 between the states sτ −1 and sτ (ﬁg. 8.21). Consequently, the matrix A is now applied twice between two consecutive time steps, which implies that the information ﬂow is doubled. The consistency matrix CE handles the lack of external inputs, such that the network stays dynamical consistent.

y t−1 00

s t−1

Id

C

A linear

c 0 0 Id

yEt−1

non− linear

CE

s t−1/2

C

A linear

c

non− linear

00

Id

yt

CE

st 0 0 Id

yEt

Undershooting doubles the information ﬂow between two successive states by applying the transition matrix twice.

Figure 8.21

It is important to note that the solution is diﬀerent from just decreasing the sparseness of the matrix A. The latter would not only cause numerical problems in the backpropagation algorithm but also disturb the balance between memory and computation.

8.6

Conclusion In this chapter we focused on dynamical consistent neural networks (DCNN) for the modeling of open dynamical systems. After a short description of small recurrent networks, including error correction neural networks, we presented a new kind of dynamical consistent neural networks. These networks allow an integrated view on particular modeling problems and consequently show better generalization abilities. We concentrated the modeling of the dynamics on one single-transition matrix and also enhanced the model from a simple statistical to a dynamical consistent handling of missing input information in the future part. The networks are now able to map

240

world model

Modeling Large Dynamical Systems with Dynamical Consistent Neural Networks

integrated system dynamics (e.g., ﬁnancial markets) instead of only a small set of time series. The ﬁnal DCNN combines the advantages of the former RNN and the ECNN. Besides the new, more powerful architectures, the modeling involves a paradigm shift in the analysis of open systems (see ﬁg. 8.1). In the beginning we looked at the description of a dynamical system from an exterior point of view. This means that we observed the information ﬂow into and out of an open system and tried to reconstruct the interior. The long-term predictability of the model ﬁnally depended on the quality of the extracted autonomous subsystem. In our new approach we describe dynamical systems from an interior viewpoint. Conceptually we start with a world model (ﬁg. 8.22). Without loss of generalization, we assume that our variables of interest are all organized as the ﬁrst elements in a large state vector. Identifying this ﬁrst section of the state vector as our observables (yt ), we can reconstruct some more unobservable states (ht ) by their indirect inﬂuence on the observables. Nevertheless, there are an inﬁnite number of variables which are unobservable and even unidentiﬁable. Their nearly inﬁnite inﬂuence can be shrunk to a ﬁnite dimensional section in the state vector: the error correction part (et ). If the error correction is equal to zero, knowledge about the unidentiﬁable variables is not necessary. Otherwise, it dispenses us from having to know the details of the unknown part of the world. Clearly, the concept of a world model is a closed one from the beginning. As a consequence of dynamical consistency the closure concept even holds for ﬁnitedimensional subsections of it. Therefore it models the dynamics as a closed system and is still able to keep model evolution exactly on the observed state trajectory (see DCNN2, eq. 8.22).

yt

ht

et

Variable space of the world model. yt stands for the observables, ht for the hidden variables which can be explained by the observables, and et for the error corrections, which close the gap between the observable and the unobservable part of the system. Figure 8.22

NOTES

241

We augmented the model-building process by incorporating prior knowledge. Learning from data is only one part of this process. The recurrent ECNN and the DCNN are two examples of this model-building philosophy. Remarkably, such a joint model-building framework does not only provide superior forecasts, but also a deeper understanding of the underlying dynamical system. On this basis it is also possible to analyze and to quantify the uncertainty of the predictions. This is especially important for the development of decision support systems. Currently we test our models in several industrial applications. Further research is conducted concerning the optimal sparseness of the transition matrix A as well as the optimal initialization method for the state vector. Acknowledgment We thank J. Zwierz for the calculations and tests during our experiments in section 8.5. The extensive work performed by M. Pellegrino in proof reading this chapter is gratefully acknowledged. The computations were performed on our neural network modeling software SENN (Simulation Environment for Neural Networks), which is a product of Siemens AG.

Notes

1 For

other cost functions see Neuneier and Zimmermann (1998). an overview of algorithmic methods see Pearlmutter (2001) and Medsker and Jain (1999). 3 For further details, the reader is referred to Haykin (1994) and Bishop (1995). 2 For

9

Diversity in Communication: From Source Coding to Wireless Networks

Suhas Diggavi

Randomness is an inherent part of network communications. We broadly deﬁne diversity as creating multiple independent instantiations (conduits) of randomness for conveying information. In the past few years a trend is emerging in several areas of communications, where diversity is utilized for reliable transmission and eﬃciency. In this chapter, we give examples from three topics where diversity is beginning to play an important role.

9.1

Introduction One of the main characteristics of network communication is the uncertainty (randomness): randomness in users’ wireless transmission channels, randomness in users’ geographical locations in a wireless network, and randomness in route failures and packet losses in networks. The randomness we study in this chapter can have timescales of variation that are comparable to the communication transmission times. This can result in complete failures in communication and therefore aﬀect reliability. Such “nonergodic” losses can be combated if we somehow create independent instantiations of the randomness. We broadly deﬁne diversity as the method of conveying information through such multiple independent instantiations. The overarching theme of this chapter is how to create diversity and how we can use it as a tool to enhance performance. We study this idea through diversity in multiple antennas, multiple users, and multiple routes. The functional modularities and abstractions of the network protocol known as stack layering (Keshav, 1997) contributed signiﬁcantly to the success of the wired Internet infrastructure. The layering achieves a form of information hiding, providing only interface information to higher layers, and not the details of the implementation. The physical layer is dedicated to signal transmission, while the data-link layer implements functionalities of data framing, arbitrating access to

244

Diversity in Communication: From Source Coding to Wireless Networks

transmission medium and some error control. The network layer abstracts the physical and data-link layers from the upper layers by providing an interface for end-to-end links. Hence, the task of routing and framing details of the link layer are hidden from the higher layers (transport and application layers). However, as we will see, the use of diversity necessarily causes cross-layer interactions. These cross-layer interactions form a subtext to the theme of this chapter. Wireless communication hinges on transmitting information riding on radio (electromagnetic) waves, and hence the information undergoes attenuation eﬀects (fading) of radio waves (see section 9.2 for more details). Such multipath fading is a source of randomness. Here diversity arises by utilizing independent realizations of fading in several domains, time (mobility), frequency (delay spread), and space (multiple antennas). Over the past decade research results have shown that multiple-antenna spatial diversity (space-time) communication can not only provide robustness, but also dramatically improve reliable data rates. These ideas are having a huge impact on the design of physical layer transmission techniques in next-generation wireless systems. Multiple-antenna diversity is the focus of section 9.3. The wireless communication medium is naturally shared by several users using the same resources. Since the users’ locations (and therefore their transmission conditions) are roughly independent, they experience independent randomness in local channel and interference conditions. Diversity in this case arises by utilizing the independent transmission conditions of the diﬀerent users as conduits for transmitting information i.e., multi-user diversity. This can be utilized in two ways. One by allowing users access to resources when it is most advantageous to the overall network. This is a form of opportunistic scheduling and is examined in section 9.4.1. The other by using the users themselves as relays to transmit information from source to destination. This is a form of opportunistic relaying, and is studied in section 9.4.2. These multi-user diversity methods are the focus of section 9.4. In transmission over networks, random route failures and packet losses degrade performance. Diversity here would be achieved by creating conduits with independent probability of route failures. For example, this can be done by transmission over multiple routes with no overlapping links. A fundamental question that arises is how we can best utilize the presence of such route diversity. In order to utilize these conduits, multiple description source coding generates multiple codeword streams to describe a source (such as images, voice, video, etc.). The design goal is to have a graceful degradation in performance (in terms of distortion) when only subsets of the transmitted streams are received. In section 9.5 we study fundamental bounds and design ideas for multiple description source coding. Therefore, diversity not only plays a role in robustness, it can also result in remarkable gains in achievable performance over several disparate applications. The details of how diversity enhances performance are discussed in the sequel.

9.2

Transmission Models

245

Reflector

Reflector

Mobile Node

Base Station

Reflector

Reflector

Figure 9.1

9.2

Radio propagation environment.

Transmission Models Since a considerable part of this chapter is about wireless communication, it is essential to understand some of the rudiments of wireless channel characteristics. In this section, we focus on models for point-to-point wireless channels and also introduce some of the basic characteristics of transmission over (wireless) networks. Wireless communication transmits information by riding (modulation) on electromagnetic (radio) waves with a carrier frequency varying from a few hundred megahertz to several gigahertz. Therefore, the behavior of the wireless channel is a function of the radio propagation eﬀects of the environment. A typical outdoor wireless propagation environment is illustrated in ﬁg. 9.1, where the mobile wireless node is communicating with a wireless access point (base station). The signal transmitted from the mobile may reach the access point directly (line-of-sight) or through multiple reﬂections on local scatterers (buildings, mountains, etc.). As a result, the received signal is aﬀected by multiple random attenuations and delays. Moreover, the mobility of either the nodes or the scattering environment may cause these random ﬂuctuations to vary with time. Time variation results in the random waxing and waning of the transmitted signal strength over time. Finally, a shared wireless environment may incur interference (due to concurrent transmissions from other mobile nodes) to the transmitted signal. The attenuation incurred by wireless propagation can be decomposed in three main factors: a signal attenuation due to the distance between communicating nodes (path loss), attenuation eﬀects due to absorption in local structures such as buildings (shadowing loss), and rapid signal ﬂuctuations due to constructive and destructive interference of multiple reﬂected radio wave paths (fading loss). Typically the path loss attenuation behaves as 1/dα as a function of distance d, with α ∈ [2, 6]. More detailed models of wireless channels can be found in Jakes (1974) and Rappaport (1996).

246

Diversity in Communication: From Source Coding to Wireless Networks

ν taps

Figure 9.2

9.2.1

Mt

Mr

Transmit

Receive

Antennas

Antennas

MIMO channel model.

Point-to-Point Model

For the purposes of this chapter we start with the following model:

yc (t) = hc (t; τ )s(t − τ )dτ + z(t) ,

(9.1)

where the transmitted signal s(t) = g(t)∗x(t) is the convolution of the informationbearing signal x(t) with g(t), the transmission shaping ﬁlter, yc (t) is the continuous time received signal, hc (t; τ ) is the response at time t of the time-varying channel if an impulse is sent at time t−τ , and z(t) is the additive Gaussian noise. The channel impulse response (CIR) depends on the combination of all three propagation eﬀects and in addition contains the delay induced by the reﬂections. To collect discrete-time suﬃcient statistics1 of the information signal x(t) we need to sample (9.1) faster than the Nyquist rate2 . Therefore we focus on the following discrete-time model: y(k) = yc (kTs ) =

ν

h(k; l)x(k − l) + z(k) ,

(9.2)

l=0

where y(k), x(k), and z(k) are the output, input, and noise samples at sampling instant k, respectively, and h(k; l) represents the sampled time-varying channel impulse response of ﬁnite length ν. Modeling the channel as having a ﬁnite duration can be made arbitrarily accurate by appropriately choosing the channel memory ν. Though the channel response {h(k; l)} depends on all three radio propagation attenuation factors, in the timescales of interest the main variations come from the small-scale fading which is well modeled as a complex Gaussian random process. Since we are interested in studying multiple-antenna diversity, we need to extend the model given in equation 9.2 to the multiple transmit (Mt ) and receive (Mr ) antenna case. The multi-input multi-output (MIMO) model is given by y(k) =

ν l=0

H(k; l)x(k − l) + z(k) ,

(9.3)

9.2

Transmission Models

247

mnH(b) Block b T

mnH(b+1) Block b+1

Block time−invariant transmission frames Figure 9.3

Block time-invariant model.

where the Mr × Mt complex3 matrix H(k; l) represents the lth tap of the channel matrix response with x ∈ C Mt as the input and y ∈ C Mr as the output (see ﬁg. 9.2). The variations of the channel response between antennas arises due to variations in arrival directions of the reﬂected radio waves (Raleigh et al., 1994). The input vector may have independent entries to achieve high throughput (e.g., through spatial multiplexing) or correlated entries through coding or ﬁltering to achieve high reliability (better distance properties, higher diversity, spectral shaping, or desirable spatial proﬁle; see section 9.3). Throughout this chapter, the input is assumed to be zero mean and to satisfy an average power constraint, i.e., E[||x(k)||2 ] ≤ P . The vector z ∈ C Mr models the eﬀects of noise and is assumed to be independent of the input and is modeled as a complex additive circularly symmetric Gaussian vector with z ∼ C N (0, Rz ), i.e., a complex Gaussian vector with mean 0 and covariance Rz . In many cases we assume white noise, i.e., Rz = σ 2 I. Finally, the basic point-to-point model given in equation 9.3 can be modiﬁed for an important special case. Many of the insights can be gained for the ﬂat fading channel where we have ν = 0 in equation 9.3. Unless otherwise mentioned, we will use this special case for illustration throughout this chapter. Also we examine the case where we transmit a block or frame of information. Here we encounter another important modeling assumption. If the transmission block is small enough so that the channel time variation within a transmission block can be neglected, we have a block time-invariant model. Such models are quite realistic for transmission blocks of lengths less than a millisecond and typical channel variation bandwidths. However, this does not imply that the channel remains constant during the entire transmission. Transmission blocks sent at various periods of time can experience diﬀerent (independent) channel instantiations (see ﬁg. 9.3). This can be utilized by coding across these diﬀerent channel instantiations, as will be seen in section 9.3. Therefore, if the transmission block is of length T , for the ﬂat-fading case, the specialization of equation 9.3 yields Y(b) = H(b) X(b) + Z(b) ,

(9.4)

where Y(b) = [y(b) (0), . . . , y(b) (T − 1)] ∈ C Mr ×T is the received sequence, H(b) ∈ C Mr ×Mt is the block time-invariant channel fading matrix for transmission block b, X(b) = [x(b) (0), . . . , x(b) (T − 1)] ∈ C Mt ×T is the “space-time” information transmission sequence, and Z(b) = [z(b) (0), . . . , z(b) (T − 1)] ∈ C Mr ×T .

248

Diversity in Communication: From Source Coding to Wireless Networks

(X ,Y ) 2

2

(X ,Y ) n

n

(X ,Y ) 1

1

(X ,Y ) 3

Figure 9.4

9.2.2

3

General multi-user wireless communication network.

Network Models

The wireless medium is inherently shared, and this directly motivates a study of multi-user communication techniques. Moreover, since we are also interested in multi-user diversity, we need to extend our model from the point-to-point scenario (eq. 9.2) to the network case. The general communication network (illustrated in ﬁg. 9.4) consists of n nodes trying to communicate with each other. In the scalar ﬂat-fading wireless channel, the received symbol Yi (t) at the ith node is given by Yi (t) =

n

hi,j Xj (t) + Zi (t),

(9.5)

j=1

j=i

where hi,j is determined by the channel attenuation between nodes i and j. Given this general model, one way of abstracting the multi-user communication problem is through embedding it in an underlying communication graph GC where the n nodes are vertices of the graph and the edges of the graph represent a channel connecting the two nodes along with the interference from other nodes. The graph could be directed with constraints and channel transition probability depending on the directed graph. A general multi-user network is therefore a fully connected graph with the received symbol at each node described as a conditional distribution dependent on the messages transmitted by all other nodes. Such a graph is illustrated in ﬁg. 9.5. We examine diﬀerent communication topologies in section 9.4 and study the role of diversity in networks.

9.3

Multiple-Antenna Diversity The ﬁrst form of diversity that we examine in some detail is that of multiple-antenna diversity. A major development over the past decade has been the emergence of space-time (multiple-antenna) techniques that enable high-rate, reliable communication over fading wireless channels. In this section we highlight some of the theoretical underpinnings of this topic. More details about practical code construc-

9.3

Multiple-Antenna Diversity

249

ACCESS POINT (B)

3 h

1,3

h1,B

h h

2,3

h2,B

3,n

1,n

1

1 n

h

1,2

hn,B

h

2

h

n

2,n

2

Figure 9.5 Graph representation of communication topologies. On the left is a general topology and on the right is a hierarchical topology.

tions can be found in Tarokh et al. (1998), Diggavi et al. (2004b), and references therein. Reliable information transmission over fading channels has a long and rich history; see Ozarow et al. (1994) and references therein. The importance of multiple antenna diversity was recognized early; see, for example, Brennan (1959). However, most of the focus until the mid-1990s was on receive diversity, where multiple “looks” of the transmitted signal were obtained using many receive antennas (see equation 9.3 using Mt = 1). The use of multiple transmit antennas was restricted to sending the same signal over each antenna, which is a form of repetition coding (Wornell and Trott, 1997). During the mid-1990s several researchers started to investigate the idea of coding across transmit antennas to obtain higher rate and reliability (Foschini, 1996; Tarokh et al., 1998; Telatar, 1999). One focus was on maximizing the reliable transmission rate, i.e., channel capacity, without requiring a bound on the rate at which error probability diminishes (Foschini, 1996; Telatar, 1999). However another point of view was explored where nondegenerate correlation was introduced between the information streams across the multiple transmit antennas in order to guarantee a certain bound on the rate at which the error probability diminishes (Tarokh et al., 1998). These approaches have led to the broad area of space-time codes, which is still an active research topic. In section 9.3.1 we ﬁrst start with an understanding of reliable transmission rate over multiple-antenna channels. In particular we examine the rate advantages of multiple transmit and receive antennas. Then in section 9.3.2 we introduce the notion of diversity order, which captures transmission reliability (error probability) in the high signal-to-noise ratio (SNR) regime. This allows us to develop criteria for space-time codes which guarantee a given reliability. section 9.3.3 examines the fundamental trade-oﬀ between maximizing rate and reliability. 9.3.1

Capacity of Multiple-Antenna Channels

The concept of capacity was ﬁrst introduced by Shannon (1948), where it was shown that even in noisy channels, one can transmit information at positive rates with the

250

Diversity in Communication: From Source Coding to Wireless Networks

error probability going to zero asymptotically in the coding block size. The seminal result was that for a noisy channel whose input at time k is {Xk } and output is {Yk }, there exists a number C such that 1 T T sup I(X ; Y ) , (9.6) C = lim T →∞ T p(xT ) T

T

p(x ,y ) where the mutual information is given by I(X T ; Y T ) = EX T ,Y T [log( p(x T )p(y T ) )], p(·) is the probability density function, and for convenience we have denoted X T = {X1 , . . . , XT } and similarly for Y T (Cover and Thomas, 1991). In Shannon (1948) it was shown that asymptotically in block length T , there exist codes which can transmit information at all rates below C with arbitrarily small probability of error over the noisy channel. Perhaps the most famous illustration of this idea was the formula derived in Shannon (1948) for the capacity C of the additive white Gaussian noise channel with noise variance σ 2 and input power constraint P :

C=

P 1 log(1 + 2 ). 2 σ

(9.7)

In this section we will focus mostly on the ﬂat-fading channels where, in equation 9.3, we have ν = 0. The generalizations of these ideas for frequencyselective channels (i.e., ν > 0) can be easily carried out (see Biglieri et al., 1998; Diggavi et al., 2004b, and references therein). We begin with the case where we are allowed to develop transmit schemes which code across multiple (B) realizations of the channel matrix {H(b) }B b=1 (see ﬁg. 9.3). In such a case, we can again deﬁne a notion of reliable transmission rate, where the error probability decays to zero when we develop codes across asymptotically large numbers of transmit blocks (i.e., B → ∞). We examine this for a coherent receiver, where the receiver uses perfect channel state information {H(b) } for each transmission block. But the transmitter is assumed not to have access to the channel realizations. To gain some intuition, consider ﬁrst the case when each transmission block is large, i.e., T → ∞. If we have one transmit antenna (Mt = 1), the channel vector response is a vector h(b) ∈ C Mr (see equation 9.4 in section 9.2). Therefore the reliable transmission rate for any (b) 2 particular block can be generalized4 {h(k)} from (9.7) as log(1 + ||h || P ). Note σ2

that when we are dealing with complex channels (as is usual in communication with in-phase and quadrature-phase transmissions), the factor of 1/2 disappears (Neeser and Massey, 1993) when we adapt the expression from equation 9.7. Now, if one codes across a large number of transmission blocks (B → ∞), for a stationary and ergodic sequence of {h(b) } we would expect to get a reliable transmission rate that is the average of this quantity. This intuition has been made precise in Ozarow et al. (1994), and references therein, for ﬂat-fading channels (ν = 0), even when we do not have T → ∞, but we have B → ∞. Therefore when we have only receive diversity, i.e., Mt = 1, for a given Mr , it is shown (Ozarow et al., 1994) that the

9.3

Multiple-Antenna Diversity

251

capacity is given by ||h||2 P ) , C = E log(1 + σ2

(9.8)

where the expectation is taken over the fading channel {h(b) } and the channel sequence is assumed to be stationary and ergodic. This is called the ergodic channel capacity (Ozarow et al., 1994). This is the rate at which information can be transmitted if there is no feedback of the channel state ({h(b) }) from the receiver to the transmitter. If there is feedback available about the channel state, one can do slightly better through optimizing the allocation of transmitted power by “waterﬁlling” over the fading channel states. The problem of studying the capacity of channels with causal transmitter-side information was introduced in Shannon (1958a), where a coding theorem for this problem was proved. Using ideas from there and perfect transmitter channel state information, capacity expressions that generalize equation 9.8 have been developed (Goldsmith and Varaiya, 1997). However, for fast time-varying channels the instantaneous feedback could be diﬃcult, resulting in an outdated estimate of the channel being sent back (Caire and Shamai, 1999; Viswanathan, 1999). However, the basic question of impact of feedback on capacity of time-varying channels is still not completely understood, and for developing the basic ideas in this chapter, we will deal with the case where the transmitter does not have access to the channel state information. We refer the interested reader to Biglieri et al. (1998) for a more complete overview of such topics. Now let us focus our attention on the multiple transmit and receive antenna channel where again as before we consider the coherent case, i.e., the receiver has perfect channel state information (CSI) H(b) . In the ﬂat-fading case where ν = 0, when we code across B transmission blocks, the mutual information for this case is 1 (b) B (b) B I({X(b) }B b=1 ; {Y }b=1 , {H }b=1 ), BT since we assume that the receiver has access to CSI. Using the chain rule of mutual information (Cover and Thomas, 1991), this can be written as 1 (b) B (b) B (b) B (b) B } ) . (9.9) ; {H } ) + I({X } ; {Y } |{H I({X(b) }B R(B) = b=1 b=1 b=1 b=1 b=1 BT Using the the assumption that the input {x(k)} is independent of the fading process (as the transmitter does not have CSI), equation 9.9 is equal to

1 (b) B (B) . (9.10) = {H(b) }B EH I {X(b) }B R(B) = b=1 ; {Y }b=1 |H b=1 BT Now, if we use the memoryless property of the vector Gaussian channel obtained by conditioning on H(b) and also due to the assumption5 that {H(b) } is i.i.d. over b, for when B → ∞ we get that R(B) =

1 |Rz + HRx H∗ | (b) B (b) B I({X(b) }B )], b=1 ; {Y }b=1 , {H }b=1 ) = EH [log( B→∞ BT |Rz | lim

(9.11)

252

Diversity in Communication: From Source Coding to Wireless Networks

where the expectation6 is taken over the random channel realizations {H(b) }. An operational meaning to this expression can be given by showing that there exist codes which can transmit information at this rate with arbitrarily small probability of error (Telatar, 1999). In general, it is diﬃcult to evaluate equation 9.11 except for some special cases. If the random matrix H(b) consists of zero-mean i.i.d. Gaussian elements, Telatar (1999) showed that C = EH [log(|I +

P HH∗ |)] Mt σ 2

(9.12)

is the capacity of the fading matrix channel.7 Therefore in this case, to achieve capacity the optimal codebook is generated from an i.i.d. Gaussian input {x(b) } P I. with Rx = E[xx∗ ] = M t The expression in equation 9.12 shows that the capacity is dependent on the eigenvalue distribution of the random matrix H with Gaussian i.i.d. components. This important connection between capacity of multiple-antenna channels and the mathematics related to eigenvalues of random matrices (Edelman, 1989) was noticed in Telatar (1999), where it was shown that the capacity could be numerically computed using Laguerre polynomials (Edelman, 1989; Muirhead, 1982; Telatar, 1999). Theorem 9.1 (Telatar, 1999) The capacity C of the channel with Mt transmitters and Mr receivers and average power constraint P is given by

∞ Tmin −1 k! Pλ C= log(1 + 2 ) λTmax −Tmin [LTk max −Tmin (λ)]2 e−λ dλ , σ M k + T t max − Tmin 0 k=0

where Tmax = max(Mt , Mr ), Tmin = min(Mt , Mr ), and Lm k (·) is the generalized Laguerre polynomial of order k with parameter m (Gradshteyn and Ryzhik, 1994). In Foschini (1996) it was observed that when Mt = Mr = M the capacity C grows linearly in M as M → ∞. Theorem 9.2 (Foschini, 1996) For Mt = Mr = M the capacity C given by (9.12) grows asymptotically linearly in M , i.e., lim

M →∞

C = c∗ (SNR) , M

(9.13)

where c∗ (SNR) is a constant depending on SNR. This quantiﬁes the advantage of using multiple transmit and receive antennas and shows the promise of such architectures for high-rate reliable wireless communication.

9.3

Multiple-Antenna Diversity

253

To achieve the capacity given in equation 9.12, we require joint optimal (maximum-likelihood) decoding of all the receiver elements which could have large computational complexity. The channel model in equation 9.3 resembles a multiuser channel (Verdu, 1998) with user cooperation. A natural question to ask is whether the simpler decoding schemes proposed in multi-user detection would yield good performance on this channel. A motivation for this is seen by observing that for i.i.d. elements of the channel response matrix (ﬂat-fading) the normalized cross1 ∗ H H → IMt ). Therefore, since nature correlation matrix decouples (i.e., lim Mr →∞ Mr provides some decoupling, a simple “matched ﬁlter” receiver (Verdu, 1998) might perform quite well. In this context a matched ﬁlter for the ﬂat-fading channel in ˜ (k) = H∗ (k)y(k). Therefore, component-wise this means equation 9.3 is given by y that ˜ i (k) = ||hi (k)||2 xi (k) + y

Mt

˜i (k), h∗i (k)hj (k)xj (k) + z

i = 1, . . . , Mt .

(9.14)

j=1

j=i

ˆ i by including the By ignoring the cross-coupling between the channels we decode x “interference” from {xj }j=i as part of the noise. However, a tension arises between &M the decoupling of the channels and the added “interference” j=1t h∗i (k)hj (k)xj (k) j=i

from the other antennas, which clearly grows with the number of antennas. It is shown in Diggavi (2001), that the two eﬀects exactly cancel each other. Proposition 9.1 If H(k) = [h1 (k), . . . , hMt (k)] ∈ C Mr ×Mt and hl (k) ∼ C N (0, IMr ), l = 1, . . . , Mt , are i.i.d., then lim

Mr →∞

Mt =αMr

Mt h∗ (k)hj (k) 2 | i | = α almost surely. Mr j=1 j=i

Therefore, using this result it can be shown that the simple detector still retains the linear growth rate of the optimal decoding scheme (Diggavi, 2001). However, in the rate RI achievable for this simple decoding scheme, we do pay a price in terms of rate growth with SNR. Theorem 9.3 If Hi,j ∼ C N (0, 1), with i.i.d. elements, then lim

Mt →∞

Mt =αMr

1 I(Y, H; X) ≥ Mt

lim

Mt →∞

Mt =αMr

RI /Mt = log(1 +

1

P σ2 α + σP2

).

Multi-user detection (Verdu, 1998) is a good analogy to understand receiver structures in MIMO systems. The main diﬀerence is that unlike multiple access channels, the space-time encoder allows for cooperation between “users.” Therefore, the encoder could introduce correlations that can simplify the job of the decoder.

254

Diversity in Communication: From Source Coding to Wireless Networks

Such encoding structures using space-time block codes are discussed further in Diggavi et al. (2004b), and references therein. An example of using the multi-user detection approach is the result in theorem 9.3 where a simple matched ﬁlter receiver is applied. Using more sophisticated linear detectors, such as the decorrelating receiver and the MMSE receiver (Verdu, 1998), one can improve performance while still maintaining the linear growth rate. The decision feedback structures also known as successive interference cancellation, or onion peeling (Cover, 1975; Patel and Holtzman, 1994; Wyner, 1974) can be shown to be optimal, i.e., to achieve the capacity, when an MMSE multi-user interference suppression is employed and the layers are peeled oﬀ (Cioﬃ et al., 1995; Varanasi and Guess, 1997). However, decision feedback structures inherently suﬀer from error propagation (which is not taken into account in the theoretical results) and could therefore have poor performance in practice, especially at low SNR. Thus, examining nondecision feedback structures is important in practice. All of the above results illustrate that signiﬁcant gains in information rate (capacity) are possible using multiple transmit and receive antennas. The intuition for the gains with multiple transmit and receive antennas is that there are a larger number of communication modes over which the information can be transmitted. This is formalized by the observation (Diggavi, 2001; Zheng and Tse, 2002) that the capacity as a function of SNR, C(SN R), grows linearly in min(Mr , Mt ), even for a ﬁnite number of antennas, asymptotically in the SNR. Theorem 9.4 C(SN R) = min(Mr , Mt ). SN R→∞ log(SN R) lim

(9.15)

In the results above, the fundamental assumption was that the receiver had access to perfect channel state information, obtained through training or other methods. When the channel is slowly varying, the estimation error could be small since we can track the channel variations and one can quantify the eﬀect of such estimation errors. As a rule of thumb, it is shown by Lapidoth and Shamai (2002) that if the estimation error is small compared to SN1 R , these results would hold. Another line of work assumes that the receiver does not have any channel state information. The question of the information rate that can be reliably transmitted over the multiple-antenna channel without channel state information was introduced in Hochwald and Marzetta (1999) and has also been examined in Zheng and Tse (2002). The main result from this line of work shows that the capacity growth is again (almost) linear in the number of transmit and receive antennas, as stated formally next.

9.3

Multiple-Antenna Diversity

255

Theorem 9.5 If the channel is block fading with block length T and we denote K = min(Mt , Mr ), then for T > K + Mt , as SN R → ∞, the capacity is8 K C(SN R) = K 1 − log(SN R) + c + o(1) , T where c is a constant depending only on Mr , Mt , T .

outage

In fact, Zheng and Tse (2002) go on to show that the rate achievable by using a training-based technique is only a constant factor away from the optimal, i.e., it attains the same capacity-SNR slope as in theorem 9.5. Further results on this topic can be found in Hassibi and Marzetta (2002). Therefore, even in the noncoherent block-fading case, there are signiﬁcant advantages in using multiple antennas. Most of the discussion above was for the ﬂat-fading case where ν = 0 in equation 9.3. However, these ideas can be easily extended for the block timeinvariant frequency-selective channels where again the advantages of multipleantenna channels can be established (Diggavi, 2001). However, when the channels are not block time-invariant, the characterization of the capacity of frequencyselective channels is an open question. Outage In all of the above results, the error probability goes to zero asymptotically in the number of coding blocks i.e., B → ∞. Therefore, coding is assumed to take place across fading blocks, and hence it inherently uses the ergodicity of the channel variations. This approach would clearly entail large delays, and therefore Ozarow et al. (1994) introduced a notion of outage, where the coding is done (in the extreme case) just across one fading block, i.e., B = 1. Here the transmitter sees only one block of channel coeﬃcients, and therefore the channel is nonergodic, and the strict Shannon-sense capacity is zero. However, one can deﬁne an outage probability that is the probability with which a certain rate R is possible. Therefore, for a block time-invariant channel with a single channel realization H(b) = H the outage probability can be deﬁned as follows. Deﬁnition 9.1 The outage probability for a transmission rate of R and a given transmission strategy p(X) is deﬁned as (9.16) Poutage (R, p(X)) = P H : I(X; Y|H(b) = H) < R . P I) then (abusing Therefore, if one uses a white Gaussian codebook (Rx = M t notation by dropping the dependence on p(X)) we can write the outage probability at rate R as 3 2 P ∗ HH |) < R . (9.17) Poutage (R) = P log(|I + Mt σ 2

256

Diversity in Communication: From Source Coding to Wireless Networks

It has been shown (Zheng and Tse, 2003) that at high SNR the outage probability is the same as the frame-error probability in terms of the SNR exponent. Therefore, to evaluate the optimality of practical coding techniques, one can compare, for a given rate, how far the performance of the technique is from that predicted through an outage analysis. Moreover, the frame-error rates and outage capacity comparisons in Tarokh et al. (1998) can also be formally justiﬁed through this argument. 9.3.2

Diversity Order

In section 9.3.1 the focus was on achievable transmission rate. A more practical performance criterion is probability of error. This is particularly important when we are coding over a small number of blocks (low delay) where the Shannon capacity is zero (Ozarow et al., 1994) and we are in the outage regime as was seen above. By characterizing the error probability, we can also formulate design criteria for space-time codes. Since we are allowed to transmit a coded sequence, we are interested in the probability that an erroneous codeword9 e is mistaken for the transmitted codeword x. This is called the pairwise error probability (PEP) and is used to bound the error probability. This analysis relies on the condition that the receiver has perfect channel state information. However, a similar analysis can be done when the receiver does not know the channel state information, but has statistical knowledge of the channel (Hochwald and Marzetta, 2000). For simplicity, we shall again focus on a ﬂat-fading channel (where ν = 0) and when the channel matrix contains i.i.d. zero-mean Gaussian elements, i.e., Hi,j ∼ C N (0, 1). Many of these results can be easily generalized for ν > 0 as well as for correlated fading and other fading distributions. Consider a codeword sequence X = [xt (0), . . . , xt (T −1)]t , where x(k) = [x1 (k), . . . , xMt (k)]t (deﬁned in eq. 9.4). In the case when the receiver has perfect channel state information, we can bound the PEP between two codeword sequences x and e (denoted by P (x → e)) as follows (Guey et al., 1999; Tarokh et al., 1998): Mr 1 . (9.18) P (x → e) ≤ Mt Es n=1 (1 + 4N0 λn ) P Es = M is the power per transmitted symbol, λn are the eigenvalues of the matrix t A(x, e) = B∗ (x, e)B(x, e), and ⎛ ⎞ x1 (0) − e1 (0) ... xMt (0) − eMt (0) ⎟ ⎜ .. .. .. ⎟ . (9.19) B(x, e) = ⎜ . . . ⎝ ⎠ x1 (N − 1) − e1 (N − 1) . . . xMt (N − 1) − eMt (N − 1)

9.3

Multiple-Antenna Diversity

257

If q denotes the rank of A(x, e), (i.e., the number of nonzero eigenvalues) then we can bound equation 9.18 as −Mr q −qMr Es λn . (9.20) P (x → e) ≤ 4N0 n=1 We deﬁne the notion of diversity order as follows. Deﬁnition 9.2 A coding scheme which has an average error probability P¯e (SN R) that behaves as log(P¯e (SN R)) = −d SN R→∞ log(SN R) lim

(9.21)

as a function of SN R is said to have a diversity order of d. In words, a scheme with diversity order d has an error probability at high SNR behaving as P¯e (SN R) ≈ SN R−d (see ﬁg. 9.6). One reason to focus on such a behavior for the error probability can be seen from the following intuitive argument for a simple scalar fading channel (Mt = 1 = Mr ). It is well known that for particular frame b, the error probability for binary transmission, conditioned on the √ (b) (b) (b) 2SN R |h | (Proakis, 1995). channel realization h , is given by Pe (h ) = Q √ √ Hence if |h(b) | 2SN R " 1, then Pe (h(b) ) ≈ 0, and if |h(b) | 2SN R 1, then Pe (h(b) ) ≈ 12 . Therefore a frame is in error with high probability when the channel gain |h(b) |2 SN1 R , i.e., when the channel is in a “deep fade.” Therefore the average 1 (b) 2 error probability is well approximated by the probability |h SN R . For 9 | 8 2 that 1 1 high SNR we can show that, for h ∼ C N (0, 1), P |h| < SN R ≈ SN R , and this explains the behavior of the average error probability. Although this is a crude analysis, it brings out the most important diﬀerence between the additive white Gaussian noise (AWGN) channel and the fading channel. The typical way in which an error occurs in a fading channel is due to channel failure, i.e., when the channel gain |h| is very small, less than SN1 R . On the other hand, in an AWGN channel errors occur when the noise is large, and since the noise is Gaussian it has an exponential tail, causing this to be very unlikely at high SNR. Given the deﬁnition 9.2 of diversity order, we see that the diversity order in equation 9.20 is at most qMr . Moreover, in inequlaity 9.20 we notice that we also q obtain a coding gain of ( n=1 λn )1/q . Note that in order to obtain the average error probability, one can calculate a naive union bound using the pairwise error probability given in equation 9.20 but this may not be tight. A more careful upper bound for the error probability can be derived (Zheng and Tse, 2003). However, if we ensure that every pair of codewords satisﬁes the diversity order in equation 9.20, then clearly the average error probability satisﬁes it as well. This is true when the transmission rate is held constant with respect to SNR, i.e., a ﬁxed-rate code. Therefore, in the case of ﬁxed rate code design the simple pairwise error probability given in equation 9.20 is suﬃcient to obtain the correct diversity order.

Diversity in Communication: From Source Coding to Wireless Networks

Error Probability

258

d d

2

1

SNR (dB)

Figure 9.6

Relationship between error probability and diversity order.

In order to design practical codes that achieve a performance target we need to glean insights from the analysis to state design criteria. For example, in the ﬂat-fading case of equation 9.20 we can state the following rank and determinant design criteria. Design criteria for space-time codes over ﬂat-fading channels (Tarokh et al., 1998): Rank criterion: In order to achieve maximum diversity Mt Mr , the matrix B(x, e) from equation 9.19 has to be full rank for any codewords x, e. If the minimum rank of B(x, e) over all pairs of distinct codewords is q, then a diversity order of qMr is achieved. q Determinant criterion: For a given diversity order target of q, maximize ( n=1 λn )1/q over all pairs of distinct codewords. Over the past few years, there have been signiﬁcant developments in designing codes which can guarantee a given reliability (error probability). An exhaustive listing of all these developments is beyond the scope of this chapter, but we give a glimpse of the recent developments. The interested reader is referred to Diggavi et al. (2004b), and references therein. Pioneering work on trellis codes for Gaussian channels was done in Ungerboeck (1982). In Tarokh et al. (1998), the ﬁrst space-time trellis code constructions were presented. In this seminal work, trellis codes were carefully designed to meet the design criteria for minimizing error probability. In parallel a very simple coding idea for Mt = 2 was developed in Alamouti (1998). This code achieved maximal diversity order of 2Mr and had a very simple decoder associated with it. The elegance and simplicity of the Alamouti code has made it a candidate for next generation of wireless systems which are slated to utilize space-time codes. The basic idea of the Alamouti code was extended to orthogonal designs in Tarokh et al. (1999). The publication of Tarokh et al. (1998) and Alamouti (1998), created a signiﬁcant community of researchers working on space-time code constructions. Over the past few years, there has been signiﬁcant progress in the construction of space-time codes for coherent channels. The design of codes that are linear in the complex ﬁeld was proposed in Hassibi and Hochwald (2002), and eﬃcient decoders for such codes

9.3

Multiple-Antenna Diversity

259

were given in Damen et al. (2000). Codes based on algebraic rotations and numbertheoretic tools are developed in El-Gamal and Damen (2003) and Sethuraman et al. (2003). A common assumption in all these designs was that the receiver had perfect knowledge of the channel. Techniques based on channel estimation and the evaluation of the degradation in performance for space-time trellis codes was examined in Naguib et al. (1998). In another line of work, non-coherent space-time codes were proposed in Hochwald and Marzetta (2000). This also led to the design and analysis of diﬀerential space-time codes for ﬂat fading channels (Hochwald and Sweldens, 2000; Hughes, 2000; Tarokh and Jafarkhani, 2000). This was also examined for frequency selective channels in Diggavi et al. (2002a). As can be seen, the topic of space-time codes is still evolving and we just have a snapshot of the recent developments. 9.3.3

Rate-Diversity Tradeoﬀ

A natural question that arises is how many codewords can we have which allow us to attain a certain diversity order. For a ﬂat Rayleigh fading channel, this has been examined (Lu and Kumar, 2003; Tarokh et al., 1998) and the following result was obtained.10 Theorem 9.6 If we use a transmit signal with constellation of size |S| and the diversity order of the system is qMr , then the rate R that can be achieved is bounded as R ≤ (Mt − q + 1) log2 |S|

(9.22)

in bits per transmission. One consequence of this result is that for maximum (Mt Mr ) diversity order we can transmit at most log2 |S| bits/sec/Hz. Note that the trade-oﬀ in theorem 9.6 is established with a constraint on the alphabet size of the transmit signal, which may not be fundamental from an information-theoretic point of view. An alternate viewpoint of the rate-diversity trade-oﬀ has been explored in Zheng and Tse (2003) from a Shannon-theoretic point of view. In that work the authors are interested in the multiplexing rate of a transmission scheme. Deﬁnition 9.3 A coding scheme which has a transmission rate of R(SN R) as a function of SN R is said to have a multiplexing rate r if R(SN R) = r. SN R→∞ log(SN R) lim

(9.23)

Therefore, the system has a rate of r log(SN R) at high SN R. One way to contrast this with the statement in theorem 9.6 is to note that the constellation size is also allowed to become larger with SN R. The naive union bound of the pairwise

260

Diversity in Communication: From Source Coding to Wireless Networks

mn(0, Mt Mr )

Diversity order

mn(0, (Mt − 1)(Mr − 1)) . . .

mn(k, (Mt − k)(Mr − k)) . . .

mn(min(Mt , Mr ), 0))

Multiplexing rate Figure 9.7 Rate-diversity trade-oﬀ curve.

error probability (eq. 9.18) has to be used with care if the constellation size is also increasing with SNR. There is a trade-oﬀ between the achievable diversity and the multiplexing gain, and d∗ (r) is deﬁned as the supremum of the diversity gain achievable by any scheme with multiplexing gain r. The main result in Zheng and Tse (2003) states the following. Theorem 9.7 For T > Mt + Mr − 1, and K = min(Mt , Mr ), the optimal trade-oﬀ curve d∗ (r) is given by the piecewise linear function connecting points in (k, d∗ (k)), k = 0, . . . , K where d∗ (k) = (Mr − k)(Mt − k).

(9.24)

If r = k is an integer, the result can be notionally interpreted as using Mr − k receive antennas and Mt − k transmit antennas to provide diversity while using k antennas to provide the multiplexing gain. However, this interpretation is not physical but really an intuitive explanation of the result in theorem 9.7. Clearly this result means that one can get large rates which grow with SNR if we reduce the diversity order from the maximum achievable. This diversity-multiplexing tradeoﬀ implies that a high multiplexing gain comes at the price of decreased diversity gain and is a manifestation of a corresponding trade-oﬀ between error probability and rate. This trade-oﬀ is depicted in ﬁg. 9.7. Therefore, as illustrated in Theorems 9.6 and 9.7, the trade-oﬀ between diversity and rate is an important consideration both in terms of coding techniques (theorem 9.6) and in terms of Shannon theory (theorem 9.7). A diﬀerent question was proposed in Diggavi et al. (2003, 2004a), where it was asked whether there exists a strategy that combines high-rate communications

9.4

Multi-user Diversity

261

with high reliability (diversity). Clearly the overall code will still be governed by the rate-diversity trade-oﬀ, but the idea is to ensure the reliability (diversity) of at least part of the total information. This allows a form of communication where the highrate code opportunistically takes advantage of good channel realizations whereas the embedded high-diversity code ensures that at least part of the information is received reliably. In this case, the interest was not in a single pair of multiplexing rate and diversity order (r, d), but in a tuple (ra , da , rb , db ) where rate ra and diversity order da was ensured for part of the information with rate-diversity pair (rb , db ) guaranteed for the other part. A class of space-time codes with such desired characteristics have been constructed in Diggavi et al. (2003, 2004a). From an information-theoretic point of view, Diggavi and Tse (2004) focused on the case when there is one degree of freedom (i.e., min(Mt , Mr ) = 1). In that case if we consider da ≥ db without loss of generality, the following result was established (Diggavi and Tse, 2004): Theorem 9.8 When min(Mt , Mr ) = 1, then the diversity-multiplexing trade-oﬀ curve is successively reﬁnable, i.e., for any multiplexing gains ra and rb such that ra + rb ≤ 1, the diversity orders da ≥ db , da = d∗ (ra ), db = d∗ (ra + rb ),

(9.25)

are achievable, where d∗ (r) is the optimal diversity order given in theorem 9.7. Since the overall code has to still be governed by the rate-diversity trade-oﬀ given in theorem 9.7, it is clear that the trivial outer bound to the problem is that da ≤ d∗ (ra ) and db ≤ d∗ (ra + rb ). Hence theorem 9.3 shows that the best possible performance can be achieved. This means that for min(Mt , Mr ) = 1, we can design ideal opportunistic codes. This new direction of enquiry is being currently explored.

9.4

Multi-user Diversity In section 9.3, we explored the importance of using many fading realizations through multiple antennas for reliable, high-rate, single-user wireless communication. In this section we explore another form of diversity where we can view diﬀerent users as a form of multi-user diversity. This is because each user potentially has independent channel conditions and local interference environment. This implies that in ﬁg. 9.5, the fading links between users are random and independent of each other. Therefore, this diversity in channel and interference conditions can be exploited by treating the independent links from diﬀerent users as conduits for information transfer. In order to explore this idea further we ﬁrst digress to discuss communication topologies. As seen in section 9.2 (see ﬁg. 9.5), we can view the n-user communication network through the underlying graph GC . One topology which is very

262

Diversity in Communication: From Source Coding to Wireless Networks

commonly seen in practice is obtained by giving special status to one of the nodes as the base-station or access point. The other nodes can only communicate to the base station. We call such a topology the hierarchical communication topology (see ﬁg. 9.5). An alternate topology that has emerged more recently is when the nodes organize themselves without a centralized base station. Such a topology is called an ad hoc communication topology, where the nodes relay information from source to destination, typically through multiple “nearest neighbor” communication hops (see also ﬁg. 9.8). In both these topologies there is potential to utilize multi-user diversity, but the methods to do so are distinct. Therefore we explore them separately in Sections 9.4.1 and 9.4.2. 9.4.1

Opportunistic Scheduling

In the hierarchical topology, we distinguish between two types of problems; the ﬁrst is the uplink channel where the nodes communicate to the access point (manyto-one communication or the multiple access channel), and the second is the downlink channel where the access point communicates to the nodes (one-to-many communication or the broadcast channel). The idea of multi-user diversity can be further motivated by looking at the scalar fading multiple access channel. If the users are distributed across geographical areas, their channel responses will be diﬀerent depending on their local environments. This is modeled by choosing the users’ channels to vary according to channel distributions that are chosen to be independent and identical across users. The rate region for the uplink channel for this case was characterized in Knopp and Humblet (1995) where it was shown that in order to maximize the total information capacity (the sum rate), it is optimal to transmit only to the user with the best channel. For the scalar channel, the channel gain determines the best channel. The result (in Knopp and Humblet, 1995) when translated to rapidly fading channels results in a form of time-division multiple access (TDMA), where the users are not preassigned time slots, but are scheduled according to their respective channel conditions. Even if a particular user at the current time might be in a deep fade, there could be another user who has good channel conditions. Hence this strategy is a form of multi-user diversity where the diversity is viewed across users. Here the multi-user diversity (which arises through independent channel realizations across users) can be harnessed using an appropriate scheduling strategy. If the channels vary rapidly in time, the idea is to schedule users when their channel state is close to the peak rate that it can support. A similar result also holds for the scalar fading broadcast channel (Li and Goldsmith, 2001; Tse, 1997). Note that this requires feedback from the users to the base station about the channel conditions. The feedback could be just the received SNR. These results are proved on the basis of two assumptions. One is that all the users have identically distributed (i.e., symmetric) channels and the other is that we are interested in long-term rates. We focus on the ﬁrst assumption, and later brieﬂy return to the question about delay. In wireless networks, the users’ channel is almost never symmetric. Nodes that

9.4

Multi-user Diversity

263

are closer to the base station experience much better channels on the average than nodes that are further away (due to path loss, see section 9.2). Therefore, using a TDMA technique that allows exclusive use of the channel to the best user would be inherently unfair to users who are further away. Suppose the long-term average rate {Tk } is to be provided to the users. The criterion used in the result in Knopp & and Humblet (1995) was the sum throughput of all the users, i.e., max k Tk . This criterion can be maximized by only scheduling the nodes with strong channels, and this could be an unfair allocation of resources across users. In order to translate the intuition about multi-user diversity into practice, one would need to ensure fairness among users. The idea in Bender et al. (2000); Jalali et al. (2000) and Chaponniere et al., is to use a proportionally fair criterion for scheduling which &K maximizes k=1 log(Tk ). This idea is inherently used in the downlink scheduling algorithm used in IS-856 (Bender et al., 2000; Chaponniere et al.; Jalali et al., 2000) (also known as the high data rate—HDR 1xEV-DO system). The scheduling algorithm implemented in the 1xEV-DO system keeps track of the average throughput Tk (t) of user k in a past window of length tc . Let the rate that can be supported to user k at time t be denoted by Rk (t). At time t, the k (t) scheduling algorithm transmits to the user with the largest R Tk (t) among the active users. The average throughputs are then updated given the current allocation. Since this idea ensures fairness while utilizing multi-user diversity, it is an instantiation of an opportunistic scheduler. This scheduling algorithm described above relies on the rates supported by the users to vary rapidly in time. But this assumption can be violated when the channels are constant or are very slowly time-varying. In order to artiﬁcially induce time variations, Viswanath et al., 2002) propose to use multiple transmit antennas and introduce random phase rotations between the antennas to simulate fast fading. This idea of phase-sweeping for multiple antennas has been also proposed in Weerackody (1993) and Hiroike et al. (1992) in the context of creating time diversity in single-user systems. With such artiﬁcially induced fast channel variations, the same scheduling algorithm used in IS-856 (outlined above) inherently captures the multi-user spatial diversity of the network. In Viswanath et al. (2002), this technique is shown to achieve the maximal diversity order (see section 9.3.2) for each user, asymptotically in number of (uniformly distributed) users. In a heavily loaded system (large number of users) and where there is a uniform distribution of users, the technique proposed in Viswanath et al. (2002) is attractive. However, for lightly loaded systems, or when delay is an important QoS criterion, its desirability is less clear. Given that the technique proposed in Viswanath et al. (2002) is based on a rate-based QoS criterion, it cannot provide delay guarantees for the jobs of diﬀerent users. This motivates the discussion of scheduling algorithms for job-based QoS criteria. In job-based criteria, the requests are assumed to come in at certain arrival times ai , and we have information about the size si (say in bytes). Response time is deﬁned to be ci − ai where ci is the time when a request was fully serviced and ai is the arrival time of the request. This is a standard QoS criterion for a request.

264

Diversity in Communication: From Source Coding to Wireless Networks i Relative response is deﬁned as ci −a (Bender et al., 1998). Relative response was si proposed in the context of heterogeneous workloads, such as the Web, i.e., requests for data of diﬀerent sizes (thus, diﬀerent si ). The above criteria relate to guarantees per request; we could also give guarantees only over all requests. For example, the overall performance criterion for a set of jobs could be the l∞ norm, namely, i (i.e., max relative response). maxi (ci − ai ) (i.e., max response time) or maxi ci −a si Other criteria based on average instead of maximum are also studied. The new generation of wireless networks can support multiple transmission rates depending on the channel conditions. Assuming an accurate communicationtheoretic model for the physical layer achievable rates (as described in section 9.3), job-scheduling algorithms are proposed and analyzed for various QoS criteria in Becchetti et al. (2002). These algorithms utilize diverse job requirements of the users to provide provable guarantees in terms of the job-scheduling criteria. These discussions just illustrate how multi-user diversity can be utilized in hierarchical networks. This form of opportunistic scheduling is an important part of the new generation of wireless data networks.

9.4.2

Mobile Ad Hoc Networks

In an ad hoc communication topology (network), one need not transmit information directly from source to destination, but instead can use other users which act as relays to help communication of information to its ultimate destination. Such multihop wireless networks have rich history (see, for example, Hou and Li, 1986, and references therein). In an important step toward systematically understanding the capacity of wireless networks, Gupta and Kumar (2000) explored the behavior of wireless networks asymptotically in the number of users. In their setup, n nodes were placed independently and randomly at locations {Si } in a ﬁnite geographical area (a scaled unit disk). Also m = Θ(n) source and destination (S-D) pairs {(Si , Ti )} are randomly chosen as shown in ﬁg. 9.8.11 The model assumes that each source Si has an inﬁnite stream of (information) packets to send to its respective destination Ti . The nodes are allowed to use any scheduling and relaying strategy through other nodes to send the packets from the sources to the destinations (see ﬁg. 9.8). The goal is to analyze the best possible long-term throughput per S-D pair asymptotically in the number of nodes n. In Gupta and Kumar (2000), a single-user communication model was used where each node transmitted information to its intended receiver (relay or destination node), and the receiver considered the interference from other nodes as part of the noise. Therefore in the communication model, a successful transmission of rate R occurred when the signal-to-interference-plus-noise ratio (SINR) was above a certain threshold β. Clearly, such a communication model can be improved by attempting to decode the “interference” from other nodes using sophisticated multi-user decoding (Verdu, 1998). But such a decoding strategy was not considered by Gupta and Kumar (2000) and therefore this need not be an information-

9.4

Multi-user Diversity

265

mnT2

mnSm mnT1

mnS1 mnTm

mnS2

Routes from sources {Si } denoted by ﬁlled circles to destinations {Ti } denoted by shaded circles. Figure 9.8

theoretically optimal strategy. In order to represent wireless signal transmission, the signal strength variation was modeled only through path loss (see section 9.2) with exponent α. Therefore, if {Pi } are the powers at which the various nodes transmitted, then the SINR from node i to node j is deﬁned as SIN R =

Pi |Si −Sj |α

σ2 +

&

k∈I

k=i

Pk |Sk −Sj |α

,

(9.26)

where I is the subset of users simultaneously transmitting at some time instant. Next, we need to deﬁne the notion of throughput per S-D pair more precisely. Deﬁnition 9.4 For a scheduling and relay policy π, let Miπ (t) be the number of packets from source node Si to its destination node Ti successfully delivered at time t. A long˜ term throughput λ(n) is feasible if there exists a policy π such that for every source-destination pair T 1 π ˜ Mi (t) ≥ λ(n) . T →∞ T t=1

lim inf

(9.27)

˜ We deﬁne the throughput λ(n) as the highest achievable λ(n). Note that λ(n) is a random quantity which depends on the node locations of the users. Our interest is in the scaling law governing λ(n), i.e., the behavior of λ(n) asymptotically in n. One of the main results of Gupta and Kumar (2000) was the following. Theorem 9.9 There exist constants c1 and c2 such that 3 2 c1 R is feasible = 1, lim P λ(n) = √ n→∞ n log n

3 2 c2 R lim P λ(n) = √ is feasible = 0 . n→∞ n

266

Diversity in Communication: From Source Coding to Wireless Networks

Therefore, the long-term per-user throughput decays as O( √1n ), showing that high per-user throughput may be diﬃcult to attain in large-scale (ﬁxed) wireless networks. This result has been recently strengthened: it was shown by Franceschetti et al. (2004) that λ(n) = Θ( √1n ). One way to interpret this result is the following. If n nodes are randomly placed in a unit disk, nearest neighbors (with high probability) are at a distance O( √1n ) apart. Gupta and Kumar (2000) show that it is important to schedule a large number of simultaneous short transmissions, i.e., between nearest-neighbors. If randomly chosen source-destination pairs are O(1) distance apart and we can only √ schedule nearest neighbor transmissions, information has to travel O( n) hops to reach its destination. Since there can be at most O(n) simultaneous transmissions at a given time instant, this imposes a O( √1n ) upper bound on such a strategy. This is an intuitive argument, and a rigorous proof of theorem 9.9 is given in Gupta and Kumar (2000) among other interesting results. Note that the coding strategy in theorem 9.9 was simple and the interference was treated as part of the noise. An open question concerns the throughput when we use sophisticated multi-user codes and decoding is used. Therefore, for such an information-theoretic characterization, understanding the rate region of the relay channel is an important component (Cover and Thomas, 1991). The relay channel was introduced in van der Meulen (1977), and the rate region for special cases was presented in Cover and El Gamal (1979). Recently Leveque and Telatar (2005); Xie and Kumar (2004), and Gupta and Kumar (2003) have established that even with network information-theoretic coding strategies, the per S-D pair throughput scaling law decays with the number of users n. A natural question that arises is whether there is any mechanism by which one can improve the scaling law for throughput in wireless networks. Mobility was one such mechanism examined in Grossglauser and Tse (2002). In the model studied, random node mobility was allowed and the locations {Si (t)} vary in a uniform, stationary, and ergodic manner over the entire disk (see ﬁg. 9.9). In the presence of such symmetric (among users) and “space-ﬁlling” mobility patterns, the following surprising result was established in (Grossglauser and Tse, 2002). Theorem 9.10 There exists a scheduling and relaying policy π and a constant c > 0 such that lim P {λ(n) = cR is feasible} = 1 .

n→∞

(9.28)

Therefore, node mobility allows us to achieve a per-user throughput of Θ(1). The main reason this was attainable was that packets are relayed only through a ﬁnite number of hops by utilizing node mobility. Thus, a node carries packets over O(1) distance before relaying it, and therefore Grossglauser and Tse (2002) shows that, with high probability, if the mobility patterns are space-ﬁlling, the number of √ hops needed from source to destination is bounded instead of growing as O( n) in

9.4

Multi-user Diversity

267

Mobility in ad hoc networks. The ﬁgure on the left shows a spaceﬁlling mobility model where the nodes uniformly cover the region. The ﬁgure on the right shows a limited one-dimensional mobility model where nodes move along ﬁxed line segments. Figure 9.9

the case of ﬁxed (nonmobile) wireless networks (Gupta and Kumar, 2000). However, the above mobility model is a generous one, since (1) it is homogeneous, i.e., every node has the same mobility process, and (2) the sample path of each node “ﬁlls the space over time.” This means that there is a nonzero probability that the node visits every part of the geographical region or area. A natural question is whether the throughput result in Grossglauser and Tse (2002) strongly depends on these two features of the mobility model. In Diggavi et al. (2002b), a diﬀerent mobility model is introduced which embodies two salient features that many real mobility processes seem to possess (e.g., cars traveling on roads, people walking in buildings or cities, trains, satellites circling earth), which are not captured by the model in Grossglauser and Tse (2002). First, an individual node typically visits only a small portion of the entire space, and rarely leaves this preferred region. Second, the nodes do move frequently within their preferred regions, and an individual region often covers a large distance. As an extreme abstraction of such mobility processes, Diggavi et al. (2002b) studied mobility patterns where nodes move along a given set of one-dimensional paths (see ﬁg. 9.9). In particular, the mobility patterns were restricted to random line segments and once chosen, the conﬁguration of line segments are ﬁxed for all time. Therefore, given the conﬁguration, the only randomness arose through user mobility along these line segments. In order to isolate the eﬀects of one-dimensional mobility from edge eﬀects, Diggavi et al. (2002b) studied a model in which the nodes are on a unit sphere but each node is constrained to move on a single-dimensional great circle. Therefore, a conﬁguration in this case was a set of line segments (great circles) which were ﬁxed throughout the communication period, and the nodes moved in randomly only on these one-dimensional paths. Thus, the homogeneity assumption

268

Diversity in Communication: From Source Coding to Wireless Networks

in Grossglauser and Tse (2002) is now relaxed. In particular, there can be pairs of nodes that are far more likely to be in close proximity to each other than other pairs. For example, if two one-dimensional paths nearly overlap, the probability of close encounter between the nodes is signiﬁcantly larger than for two paths that are “far apart.” This lack of homogeneity implies, as shown in Diggavi et al. (2002b), that there are conﬁgurations where constant throughput is unattainable even with mobility. Since the capacity of such a mobile ad hoc network then depends on the constellation of one-dimensional paths, the question becomes one of scaling laws for a random conﬁguration. Therefore, the conﬁgurations themselves are chosen randomly with each one-dimensional path (great circle) chosen independently and with an identical uniform distribution. Given such a random conﬁguration, the question then becomes whether “bad” conﬁgurations (where the per S-D pair throughput is not Θ(1)) occur often. One of the key ideas in Diggavi et al. (2002b) was the identiﬁcation and proof of typical (“good”) conﬁgurations, on which the average long-term throughput per node is Θ(1). Intuitively the typical conﬁgurations deﬁned in Diggavi et al. (2002b) are those where the fraction of one-dimensional paths intersecting any given area is uniformly close to its expected number. That is, the empirical probability counts are uniformly close to the underlying probability of a random one-dimensional path intersecting that area. Therefore, even for a particular deterministically chosen conﬁguration which satisﬁes the typicality condition, the per S-D pair throughput is Θ(1). One of the main results in Diggavi et al. (2002b) is that if the one-dimensional paths are chosen (uniformly) randomly and independently, then for almost all constellations of such paths, the throughput per S-D pair is Θ(1). Therefore, for random conﬁgurations the probability of an atypical conﬁguration is shown to go to zero asymptotically in network size n. Thus, although each node is restricted to move in a one-dimensional space, the same asymptotic performance is achieved as in the case when they can move in the entire two-dimensional region. Theorem 9.11 Given a conﬁguration C, there exists a scheduling and relaying policy π and a constant c > 0 such that lim P {λ(n) = cR is feasible |C} = 1

n→∞

(9.29)

for almost all conﬁgurations C as n → ∞, i.e., the probability of the set of conﬁgurations for which the policy achieves a throughput of λ goes to 1 as n → ∞. Next we give a ﬂavor of the proof techniques used to prove theorem 9.11. First, we examine a relaying strategy where at each time, every node carries source packets, which originate from that node, and relay packets, which originated from other nodes and are to be forwarded to their ﬁnal destinations. In phase I, each sender attempts to transmit a source packet to its nearest receiver, who

9.4

Multi-user Diversity

269

Phase I

Source

Destination

Phase I & II Phase II

Phase II Phase I

Relay Relay

The relaying strategy for mobile nodes. In phase I, the source attempts to transfer packets to relays. During phase II, the relays attempt to transfer packets to the destination.

Figure 9.10

will serve as a relay for that packet. In phase II, each sender identiﬁes its nearest receiver and attempts to transmit a relay packet destined for it, if the sender has one (see ﬁg. 9.10). As in equation 9.26, a successful transmission of rate R occurs when the signal-to-interference-plus-noise ratio (SINR) is above a certain threshold β. Note that it can be shown that if the source nodes attempt to “wait” till it encounters its destination, the per S-D pair throughput cannot be Θ(1). Therefore every source spreads its traﬃc to random intermediate nodes depending on the mobility. Moreover, each packet is forwarded successfully to only one relay, i.e., there is no duplication. Mobility allows source-destination pairs to be able to relay information through several independent relay paths, since nodes have changing nearest neighbors due to mobility. This method of relaying information through independent attenuation links which vary over time is also a form of multiuser diversity. One can see this by observing that the transmission occurs over several realizations of the communication graph GC . The relaying strategy which utilizes mobility schedules transmissions over appropriate realizations of the graph. Conceptually, this use of independent relays to transmit information from source to destination is illustrated in ﬁg. 9.11, where the strategy of Theorems 9.10 and 9.11 is used. Intuitively, if the source is able to uniformly spread its traﬃc through each of its relays (see ﬁg. 9.11) then we can expect to obtain Θ(1) throughput per S-D pair. In order for this to occur, we need to show two properties: 1. Every node spends the same order of time as the nearest neighbor to Θ(n) other nodes. This ensures that each source can spread its packets uniformly across Θ(n) other nodes, all acting as relays, and these packets can in turn be merged back into their respective ﬁnal destinations.

270

Diversity in Communication: From Source Coding to Wireless Networks

2. When communicating with the nearest neighbor receiver, the capture probability is not vanishingly small even in a large system, even though there are Θ(n) interfering nodes transmitting simultaneously.

throughput-delay trade-oﬀ

However, with one-dimensional mobility, it is shown in Diggavi et al. (2002b) that there exist conﬁgurations where these properties cannot be satisﬁed. This is where the identiﬁcation of typical conﬁgurations becomes important. For typical conﬁgurations through a detailed technical argument it is shown in Diggavi et al. (2002b) that these properties hold. Moreover, for randomly chosen conﬁgurations, it is shown that such typical conﬁgurations occur with probability going to 1 asymptotically in n. Therefore, using these components, the proof of theorem 9.11 is completed. There is a dramatic gain in the per S-D pair throughput in theorems 9.10 and 9.11 over theorem 9.9 from O( √1n ) to Θ(1). A natural question to ask is whether there is a cost to this improvement. The results in theorems 9.10 and 9.11 utilized node mobility to deliver the information from source to destination. Therefore, the timescale over which this is eﬀective is dependent on the velocity of the nodes, which determines the rate of change of the topology. Hence we can expect there to be signiﬁcantly larger packet delays for this scheme as compared to the ﬁxed network. In some sense, the Gupta-Kumar result in theorem 9.9 has a smaller throughput, but also has a smaller packet delay, since the delays depend on successful packet transmissions over the route and not the change in node topology. Hence a natural question to ask is whether there exists a fundamental trade-oﬀ between delay and throughput in ad hoc networks. This question was recently studied in El Gamal et al. (2004), where the authors quantiﬁed this trade-oﬀ. In order to quantify the trade-oﬀ there needs to be a formal deﬁnition of delay. In El Gamal et al. (2004) delay D(n) is deﬁned as the sum of the times spent in every relay node. This deﬁnition does not include the queueing delay at the nodes, just the delay incurred in successful transmission of the packet on each single hop of the route. Given this deﬁnition of delay, El Gamal et al. (2004) established that for a ﬁxed random network of n nodes, the delay-throughput trade-oﬀ for network, when λ(n) = O(1/ n log(n)) is D(n) = Θ(nλ(n)). For a mobile ad hoc √ n λ(n) = Θ(1), El Gamal et al. (2004) showed that D(n) = Θ( v(n) ), where v(n) is the velocity of the mobile nodes. Therefore, this quantiﬁes the cost of higher throughput in mobile networks. The theoretical developments in sections 9.4.1 and 9.4.2 indicate the strong interactions between the physical layer coding schemes and channel conditions and the networking issues of resource allocation and application design. This is an important insight we can draw for the design of wireless networks. Therefore, several problems which are traditionally considered as networking issues and are typically designed independent of the transmission techniques need to be reexamined in the context of wireless networks. As illustrated, diversity needs to be taken into account while solving these problems. Such an integrated approach is a major lesson learned from the theoretical considerations, and we develop another aspect of this through the study of source coding using route diversity in section 9.5.

9.5

Route Diversity

271

RELAY NODES

SOURCE

DESTINATION

Multiple routes Multiple routes

DIRECT PATH

PHASE I

Figure 9.11

9.5

PHASE II

Multi-user diversity through relays.

Route Diversity The interest in section 9.4.2 was the characterization of long-term throughput from source to destination. However, in applications such as sensor networks (see, for example, Pottie and Kaiser, 2000; Pradhan et al., 2002, and references therein), there could be node failures which lead to routes being disconnected through a transmission period. This might become particularly crucial when there are strong delay constraints, such as those in real-time data delivery. Such route failures can also occur in ad hoc networks (discussed in section 9.4.2) as well as in wired networks. In multihop relay strategies, we could utilize the existence of multiple routes from source to destination in order to increase the probability of successfully receiving the information at the destination within delay constraints despite route (path) failures. This is a form of route diversity (see ﬁg. 9.12) and was ﬁrst suggested by Maxemchuk (1975) in the context of wired networks. Note that in a broad sense, the multi-user diversity studied in mobile ad hoc networks in section 9.4.2 also utilizes the presence of multiple routes from source to destination. However, in that case the multiple routes were utilized to increase the long-term per S-D pair throughput. In the topic of this section we will utilize the multiple routes for low-delay applications. We will examine this problem in the context of delivering a real-time source (like speech, images, video, etc.) with tight delay constraints. If the same information about the source is transmitted over both routes, then this is a form of repetition coding. However, when both routes are successful, there is no performance advantage. Perhaps a more sophisticated technique would be to send correlated descriptions of the source in the two routes such that each description is individually good, but they are diﬀerent from one another so that if both routes are successful one gets a better approximation of the source. This is the basic idea behind multiple description (MD) source coding (El Gamal and Cover, 1982). This notion can be

272

Diversity in Communication: From Source Coding to Wireless Networks

Side Information

Description 1 Source Encoder

Route 1

Route 2

Destination

Description 2

Source sequence

Figure 9.12

Route diversity.

extended to more than two descriptions as well, but in this section we will focus on the two-description case for simplicity. The idea is that the source is coded through several descriptions, where we require that performance (distortion) guarantees can be given to any subset of the descriptions and the descriptions mutually reﬁne each other. This is the topic discussed in sections 9.5.1 and 9.5.2. In a packet-based network such as the Internet, packet losses are inevitable due to congestion or transmission errors. If the data does not have stringent delay constraints, error recovery methods typically ensure reliability either through a repeat request protocol or through forward error correction (Keshav, 1997). Another technique is through scalable (or layered) coding techniques which send a lower-rate base layer or coarser description of the source and send reﬁnement layers to enhance the description. Such a technique is again dependent on reliable delivery of the base layer, and if the base layer is lost, the enhancement layers are of no use to the receiver. Therefore, such layered techniques are again inherently susceptible to route failures. These arguments reemphasize the need to develop multiple description (MD) source coding schemes. Note that the layered coding schemes form a special case of such as MD coding scheme, where guarantees of performance are not given for individual layers, but the layers reﬁne the coarser description of the source. An important application for future wireless networks could be real-time video. There has been signiﬁcant research into robust video coding in the presence of packet errors (Reibman and Sun, 2000). The main problem that arises in video is that the compression schemes typically have motion compensation, which introduces memory into the coded stream. Therefore, decoding the current video frame requires the availability of previous video frames. If previous frames are corrupted or lost, the decoder is required to develop methods to conceal such errors. This is an active research topic especially in the context of wireless channels (Girod and Farber, 2000). However, an appealing approach to this problem might be through route diversity and MD coding, and this is brieﬂy discussed in section 9.5.1.

9.5

Route Diversity

273

9.5.1

rate-distortion function

Multiple Description (MD) Source Coding

In order to formalize the requirement of the MD source coder, we study the setup shown in ﬁg. 9.13. As mentioned earlier, we will illustrate the ideas using only the two-description MD problem. Given a source sequence {X(k)}, we want to design an encoder that sends two descriptions at rate R1 and R2 over the two routes such that we get guaranteed approximations of the source when either route fails, or when both succeed. In section 9.5.2 we develop techniques that achieve such an objective. In order to understand the fundamental bounds on the performance of such techniques, we need to examine the problem from an information-theoretic point of view. The main tool to do this is given in rate-distortion theory (Cover and Thomas, 1991). This theory describes fundamental limits of the trade-oﬀ between the rate of the representation of a source and the quality of the approximation. Not surprisingly, the origins of this theory are in Shannon (1948, 1958b). In order to give some of the basic ideas, we ﬁrst make a short digression on the rudiments of this theory. Given a source sequence X T = {X(1), . . . , X(T )} from a given alphabet X , the source encoder needs to describe it using R bits per source sample (i.e., with a total of RT bits for the sequence). Equivalently we map the source to the index set J = {1, . . . , 2RT }. The goal is that given this description a decoder ˆT = is able to approximately reconstruct the source sequence by the sequence X ˆ ˆ {X(1), . . . , X(T )}. This is accomplished by constructing a function f : J → Xˆ T , and Xˆ is the alphabet over which the reconstruction is done. Common examples for the alphabet are X = R = Xˆ , or the binary ﬁeld. The distortion measure ˜ T,X ˆ T ) quantiﬁes the quality of the approximation between the reconstructed d(X and original source sequence. Typically, the distortion measure is a single-letter function constructed as T ˜ T,X ˆT ) = 1 ˆ d(X(i), X(i)), d(X T i=1

(9.30)

ˆ denotes the quality of the approximation for each sample. Common where d(X, X) ˆ = |X − X| ˆ 2 and Hamming distance (Cover and Thomas, examples are d(X, X) 1991). The simplest framework to give performance bounds is to analyze the performance of a source encoder for an independent and identically distributed random source sequence. Typically, the interest is in the average distortion over the set of input sequences, for the given probability distribution associated with the source ˜ T,X ˆ T )], and the problem besequence. Therefore, the average distortion is E[d(X comes one of quantifying the smallest rate R that be used to describe the source with average ﬁdelity D, asymptotically in the block length T . This is called the rate-distortion function R(D) and can be given an operational meaning by proving that there exist source codes that can achieve this fundamental bound (Cover and

274

Diversity in Communication: From Source Coding to Wireless Networks

Thomas, 1991). The central result in single source rate-distortion theory is that R(D) is characterized as R(D) =

min

p(ˆ x|x):E[d(x,ˆ x)]≤D

ˆ I(X; X),

(9.31)

ˆ represents the mutual information between X and X ˆ where, as before, I(X; X) (Cover and Thomas, 1991). A simple instantiation of this result is the special case where we want D = 0, i.e., the lossless case. In this case, one can see that R(0) = H(X), where H(X) is the entropy of the source. Another important special case is when the source sequence comes from a Gaussian distribution, X ∼ N (0, σx2 ), and ˆ = |X − X| ˆ 2. we are interested in the squared error distortion metric, i.e., d(X, X) 2 σ In this case, equation 9.31 evaluates to R(D) = 12 log Dx for D ≤ σx2 and zero otherwise. Another way of writing this is in terms of the distortion-rate function D(R), which characterizes the smallest distortion achievable for a given rate. In the Gaussian case we see that D(R) = σx2 2−2R . We will interchangeably consider these two quantities. The result in equation 9.31 guarantees only that the average distortion does not exceed D. However, under some regularity conditions, the rate-distortion function remains the same even when we require that the probability of the distortion ˜ T,X ˆ T ) exceeding D to go to zero (Berger, 1977; Cover and Thomas, 1991). d(X The characterization of the rate-distortion function given in equation 9.31 has also been extended in many other ways including sources with memory (Cover and Thomas, 1991). Armed with this background, we can now formulate the question on the fundamental rate-distortion bounds on multiple description (MD) source coding. The multiple description source encoder needs to produce two descriptions of the source using R1 , R2 bits per source sample respectively. We can formally describe ˆ 2 (k)}, {X ˆ 12 (k)} use ˆ 1 (k)}, {X the problem by requiring that the reconstructions {X these descriptions to approximately reconstruct the source (see ﬁg. 9.13). As in the “single description” case, we accomplish this by constructing functions f1 : J1 → Xˆ T , f2 : J2 → Xˆ T , f12 : J1 × J2 −→ Xˆ T ,

(9.32)

where Ji = {1, . . . , 2Ri T }, i = 1, 2, and Xˆ is the alphabet over which the reconstruction is done. We want the approximations to give average ﬁdelity guarantees of T ˜ T,X ˜ T,X ˜ T,X ˆ 1T )] ≤ D1 , E[d(X ˆ 2T )] ≤ D2 , E[d(X ˆ 12 )] ≤ D12 . E[d(X

(9.33)

The rate-distortion question in this context is to characterize the bounds on the tuple (R1 , D1 , R2 , D2 , D12 ). Therefore, we are interested in characterizing the achievable rate-distortion region described by the tuple (R1 , D1 , R2 , D2 , D12 ). As can be seen, this seems like a much more diﬃcult question than the singledescription problem for which there is a complete characterization. As a matter of fact, the complete characterization of the MD rate region is still an open question. This problem was formalized in 1979, and in El Gamal and Cover (1982), a

9.5

Route Diversity

275

Description 1 mnR1 SOURCE mn{X(k)}

Multiple Description Source coder

Figure 9.13

Joint Decoder

mnR2

Description 2

Decoder 1

Route 1

Route 2

ˆ 1 (k)} mn{X

ˆ 12 (k)} mn{X

Decoder 2

ˆ 2 (k)} mn{X

Multiple description (MD) source coding.

theorem was proved which demonstrated a region of the tuple (R1 , D1 , R2 , D2 , D12 ) for which MD source codes exist. Theorem 9.12 (El Gamal and Cover, 1982) Let X(1), X(2), . . . be a sequence of i.i.d. ﬁnite alphabet random variables drawn according to a probability mass function p(x). If ˆm ), m = 1, 2, 12 then an achievable rate region the distortion measures are dm (x, x for tuples (R1 , R2 , D1 , D2 , D12 ) is given by the convex hull of the following. ˆ 1 ), R2 ≥ I(X; X ˆ 2 ), R1 + R2 ≥ I(X; X ˆ 12 , X ˆ1, X ˆ 2 ) + I(X ˆ1; X ˆ 2 ) (9.34) R1 ≥ I(X; X ˆ t )] ≤ ˆ2 , x ˆ12 ) such that E[dt (X, X for some probability mass function p(x, x ˆ1 , x Dt , t = 1, 2, 12. This region was further improved in Zhang and Berger (1987) to a larger region for which MD source codes exist. However, what is unknown is whether these characterizations completely exhaust the set of tuples that can be achieved, i.e., a converse for the MD rate-distortion region. There are some special cases for which there are further results (Ahlswede, 1985; Fu and Yeung, 2002, and references therein). There has also been recent work on achievable rate-regions for more than two descriptions (Pradhan et al., 2004; Venkataramani et al., 2003). However, in these cases as well the complete characterization is unknown. The only case for which the MD region is completely characterized is that for memoryless Gaussian sources with squared error distortion measures and speciﬁcally for two descriptions.12 In Ozarow (1980), it was shown that the two-description MD region given in El Gamal and Cover (1982) was also applicable to the Gaussian case with squared error distortion where the alphabet is not ﬁnite. Moreover it was shown that the region in theorem 9.12 was in fact the complete characterization by proving a converse (outer bound) to the rate region. In this context the source was modeled as a sequence of i.i.d. Gaussian random variables X ∼ N (0, σx2 ) and the squared error distortion measure was chosen, i.e., ˆm ) = |x − x ˆm |2 , m = 1, 2, 12. Therefore, specializing the result in theorem dm (x, x 9.12 to the Gaussian case yields the following complete characterization of the set

276

Diversity in Communication: From Source Coding to Wireless Networks

of all achievable tuples (R1 , R2 , D1 , D2 , D12 ) (El Gamal and Cover, 1982; Ozarow, 1980): D1 ≥ σx2 e−2R1 , D2 ≥ σx2 e−2R2 , σx2 e−2(R1 +R2 ) D12 ≥ =

2 .

= D1 D2 D1 D2 −2(R +R ) 1 2 1 − σ2 1 − σ2 − −e 1− σ2 σ2 x

x

x

(9.35)

x

In order to interpret this result, consider the following. As seen before, for a single-description Gaussian problem, the minimum distortion for a given rate is D(R) = σx2 2−2R . Therefore, the distortions D1 , D2 clearly need to be governed by the single-description bound, and this explains the ﬁrst two inequalities in equation 9.35. However, in the MD problem we also need to bound the distortion D12 when both descriptions are available. From the single-description bound it is clear that we would have D12 ≥ D(R1 + R2 ) = σx2 2−2(R1 +R2 ) . Therefore, a natural question is whether this bound on D12 can be achieved with equality. However, the result in theorem 9.12 shows that this is not possible unless D1 = σx2 or D2 = σx2 . Here is where the tension between the two descriptions manifests itself. We examine the tension in the symmetric case, when we have D1 = D2 = D, R1 = R2 = R and specialize it for the unit variance source σx2 = 1. If we want the individual descriptions to be as eﬃcient as possible (i.e., D = e−2R ), then we see that D , which is far larger than D(R1 + R2 ) = e−2(R1 +R2 ) = D2 . For small D, D12 ≥ 2−D 2 we see that D12 is approximately D 2 , which is much larger than D . Therefore, if we ask that the individual descriptions be close to optimal themselves, then they do not mutually reﬁne each other very well. This reveals the tension between getting small the distortions D1 , D2 of individual descriptions and a small D12 . We need to make the individual descriptions coarser in order to get more mutual reﬁnement in D12 . One important real-time application is that of video coding. This can be viewed as a sequence of individual frames which are correlated to each other. The traditional way of encoding video is by describing the “current” frame diﬀerentially with respect to the previous frame. This is done through a block-matching technique where the “closest” (in terms of squared distance) blocks from the previous frame are matched to blocks in the current frame, and then only the diﬀerences are transmitted. The rationale behind this idea is that blocks are only relatively displaced due to motion of objects in the video and hence this mechanism is called motion compensation in the literature (Reibman and Sun, 2000). Note that in this scheme, the encoder explicitly uses the knowledge of the previous frame. Clearly, when there are packet/route errors and the previous frame is not received at the destination, the reconstruction is diﬃcult since the previous reference frame is not available. Therefore, several ﬁxes to this problem have been developed over the past two decades (see Girod and Farber, 2000, and references therein). In a more abstract framework, we can think of the video as a sequence of correlated random variables which we are trying to describe eﬃciently. In Witsenhausen and Wyner an alternate approach was taken by considering the video

9.5

Route Diversity

277

coding problem as a source coding problem with side information. In this setting, after encoding and transmitting the “previous” frame, the “current” frame develops an encoder which does not explicitly depend on the knowledge of the previous frame. The basic idea of this scheme arises from encoding schemes and decoding described in Slepian and Wolf (1973) and Wyner and Ziv (1976). Since the encoder does not explicitly use the side-information (previous frame) it can be designed such that the computational complexity is shifted from the encoder to the decoder. Such an architecture is attractive for applications where the encoder needs to be simple but the decoder can be more complex. This idea has been developed comprehensively in Puri and Ramchandran (2003), where practical coding techniques are developed with such applications in mind.

Decoder 1

f1 (s, i(1) ) i(1) ∈ I (1)

ˆ 1 (k) X

Route 1 Source

Decoder 12

ENCODER

i(2) ∈ I (2)

{X(k)}

f12 (s, i(1),i(2) )

ˆ 12 (k) X

Route 2

{S(k)}

Switch for encoder SI

Side information (SI)

Decoder 2

f2 (s, i(2) ) Figure 9.14

ˆ 2 (k) X

Multiple description source coding with side information.

However, even with this idea the robustness to route failures which is inherent to MD coding is not captured. Motivated by this, Diggavi and Vaishampayan (2004) considered the MD problem with side information (see ﬁg. 9.14). In this abstract setting, we want to encode a source {X(k)} when the decoder has knowledge of a correlated process {S(k)} as side-information. For example, in the setting of Witsenhausen and Wyner and Puri and Ramchandran (2003), the side information could be the previous frame. In order to describe the source in the presence of route diversity, we can pose an MD problem, but now with side information as shown in ﬁg. 9.14. Clearly this is a generalization of the MD problem and an achievable rate region was established for this problem in Diggavi and Vaishampayan (2004). Theorem 9.13 Let (X(1), S(1)), (X(2), S(2)) . . . be drawn i.i.d. ∼ Q(x, s). If only the decoder has access to the side information {S(k)}, then (R1 , R2 , D1 , D2 , D12 ) is achievable if there exist random variables (W1 , W2 , W12 ) with probability mass function p(x, s, w1 , w2 , w12 ) = Q(x, s)p(w1 , w2 , w12 |x), that is, S ↔ X ↔ (W1 , W2 , W12 )

278

Diversity in Communication: From Source Coding to Wireless Networks

form a Markov chain, such that R1 > I(X; W1 |S), R2 > I(X; W2 |S)

(9.36)

R1 + R2 > I(X; W12 , W1 , W2 |S) + I(W1 ; W2 |S) and there exist reconstruction functions f1 , f2 , f12 which satisfy D1 ≥ E[d1 (X, f1 (S, W1 ))], D2 ≥ E[d2 (X, f2 (S, W2 ))]

(9.37)

D12 ≥ E[d12 (X, f12 (S, W12 , W1 , W2 ))]. This result gives an achievable rate region, but the complete characterization for this problem is open. A slightly improved region to theorem 9.13 is also found in Diggavi and Vaishampayan (2004). However, it is unknown whether this region exhausts the achievable rate region. But for the case when both the source and the side information are jointly Gaussian, and we are interested in the squared error distortion, a complete characterization of the rate-distortion region was obtained in Diggavi and Vaishampayan (2004). In more detail, the result was the following. Let (X(1), S(1)), (X(2), S(2)) . . . be a sequence of i.i.d. jointly Gaussian random variables. With no loss of generality this can be represented by S(k) = α [X(k) + U (k)] ,

(9.38)

where α > 0 and {X(k)}, {U (k)} are independent Gaussian random variables with 2 2 , E[U 2 ] = σU . As considered in theorem 9.13, only E[X] = 0 = E[U ], E[X 2 ] = σX the decoder has access to the side information {S(k)}. If the distortion measures are ˆm ) = ||x − x ˆm ||2 , m = 1, 2, 12 then it is shown in Diggavi and Vaishampayan dm (x, x (2004) that the set of all achievable tuples (R1 , R2 , D1 , D2 , D12 ) are given by 2 −2R1 D1 > σF e ,

2 −2R2 D2 > σF e ,

D12 >

2 −2(R1 +R2 ) e σF , ˜ 2 ˜ − Δ) 1−( Π

(9.39)

σ2 σ2

2 ˜ Δ ˜ are given by = σ2X+σU2 and Π, where σF X U D2 D2 ˜ = 1 − D1 ˜ = D1 Π 1 − , Δ − e−2(R1 +R2 ) . 2 2 2 2 σF σF σF σF

(9.40)

The result in equation 9.39 also shows that the rate-distortion region in this case is the same as that achieved when both encoder and decoder have access to the side information. That is, in the Gaussian case, the rates that can be achieved are the same whether the switch in ﬁg. 9.14 is open or closed. In Wyner and Ziv (1976) it was shown that in the single-description Gaussian case, the decoder-only side information rate-distortion function coincided with that when both encoder and decoder were informed of the side information. The result (eq. 9.39) establishes that this is also true in the Gaussian two-description problem with decoder side information. However, the encoding and decoding techniques to achieve these rate tuples are very diﬀerent when the encoder has access to the side information than

9.5

Route Diversity

279

when it does not. This shows that there might be eﬃcient mechanisms to construct MD video coders which are robust to route failures. Some of the code constructions that bring this idea to fruition are discussed in section 9.5.2. 9.5.2

Quantizers for Route Diversity

The results given in section 9.5.1 show the existence of codes that can achieve the rate tuples given in theorems 9.12 and 9.13, but there are no explicit constructions. In this section we explore explicit coding schemes which utilize the presence of route diversity. As seen in section 9.5.1, the single-description rate-distortion function quantiﬁes the fundamental limits of the trade-oﬀ between the rate of the representation of a source and its average ﬁdelity. The result in equation 9.30 showed the existence of such codes. Explicit constructions of these codes are called quantizers (Gersho and Gray, 1992; Gray and Neuhoﬀ, 1998). More formally, quantizers map a sequence {X(1), . . . , X(T )} of source samples into a “representative” reconstrucˆ ˆ )} through an explicit mapping which is typically computation {X(1), . . . , X(T tionally eﬃcient. Scalar quantizers operate on a single source sample X(k) at a time. Most current systems use scalar quantizers (Jayant and Noll, 1984). However, rate-distortion theory tells us that using sequences is important, and hence vector quantizers use sequences of source samples, i.e., T > 1 for quantization. Quantization techniques for single description have been quite well studied and understood (Gersho and Gray, 1992; Gray and Neuhoﬀ, 1998; Jayant and Noll, 1984). The rudiments of the MD coding ideas arose in the 1970s at Bell Laboratories. Jayant (1981) proposed and analyzed a very simple idea of channel splitting. The basic idea was to oversample a speech signal and send the odd samples through one channel and the even ones through another. However, this technique is not very eﬃcient in terms of rate. Many of such simple coding techniques were being considered at Bell laboratories, but the ideas were not archived. These questions actually motivated the information-theoretic formulation of the MD problem described in section 9.5.1. The systematic study of coding for multiple descriptions was initiated in Vaishampayan (1993). Its publication resulted in a spurt of recent activity on the topic (see, for example, Diggavi et al., 2002c; Goyal and Kovacevic, 2001; Ingle and Vaishampayan, 1995; Vaishampayan et al., 2001, and references therein). More recently the utility of MD coding in conjunction with route diversity has also created interest in the networking community (see Apostolopoulos and Trott, 2004, and references therein). The basic idea introduced in Vaishampayan (1993) constructed scalar quantizers for the MD problem. This was done speciﬁcally for the symmetric case, where D1 = D2 and R1 = R2 . This symmetric construction was extended to structured (lattice) vector quantizers in Vaishampayan et al. (2001). The symmetric case has been further explored by several other researchers (Goyal and Kovacevic, 2001; Ingle and Vaishampayan, 1995). The importance of structured quantizers is in the computational complexity of the source encoder. For example, just as in channel

280

Diversity in Communication: From Source Coding to Wireless Networks

Λ

Multiple Description Quantizer

SOURCE

QUANTIZER Lattice Λ

λ

Multiple Description Labeling Λ1 Λ2

DESIGN STRUCTURE

Figure 9.15

λ1 λ2

Λ1

Λ2 Λ

lcm

Λs

Structure of multiple description quantizer.

coding, trellis-based structures are also important in source coding. Such structures have also been proposed for the symmetric MD problem (Buzi, 1994; Jafarkhani and Tarokh, 1999). In general, unstructured quantizers based on training on some source samples can also be constructed, but the computational complexity of such techniques is much higher than structured (lattice) quantizers and therefore they are less attractive in practice. Such unstructured quantizers have been considered in the literature (Fleming et al., 2004). Our focus in this chapter will be on structured quantizers for which we have computationally eﬃcient encoders as well as techniques to analyze their performance. In general we would like to design MD quantizers that can attain an arbitrary rate-distortion tuple, and not just the symmetric case. This is motivated by applications where the multiple routes have disparate capacities (and therefore rate requirements) as well as diﬀerent probabilities of route failures. In these cases, we need to design asymmetric MD quantizers which give graceful degradation in performance with route failures. Such a structure was studied in Diggavi et al. (2002c), and is depicted in ﬁg. 9.15. We illustrate the ideas of MD quantizer design from Diggavi et al. (2002c), using a scalar example. In ﬁg. 9.16, the ﬁrst line represents a uniform scalar quantizer. If we take a single source sample X(k) ∈ R , then the uniform quantizer ˆ on the one-dimensional maps this sample to the closest “representative” point X (scaled integer) lattice Λ. Loosely, a T -dimensional lattice is a set of regularly spaced points in R T for which any point can be chosen as the origin and the set of points would be the same. A more precise notion is based on the set of points forming an additive group (Conway and Sloane, 1999). Each of the representative points is given a unique label λ and this label is transmitted to the receiver. The transmission rate depends on the number of labels. Typically a ﬁnite set of points 2M is used to represent the labels. In a straightforward manner, this translates to a rate of log(2M ) bits per source sample. If the source either has ﬁnite extent or a ﬁnite second-order moment, such a quantizer would have a bounded squared error distortion. If the representative points are separated by a distance of Δ, then the worst-case squared error distortion between a source sample and the representative 2 , (M +1)Δ ]. For a uniform distribution of is Δ4 for source samples X(k) ∈ [− (M +1)Δ 2 2 2 (M +1)Δ the source in the region X(k) ∈ [− (M +1)Δ , ], the average distortion is Δ 2 12 2

9.5

Route Diversity

281

mn −3Δ 2

mnΛ

mn −Δ 2

ˆ mnX mnΛ1

mn −3Δ 2

ˆ1 mnX

−3

−2

−1 mn −5Δ 2

mnΛ2

−2

mn Δ 2

mnλ mn 3Δ 2

mnλ1

mn 5Δ 2

1

0 mn Δ 2

1

2

mn(λ1 , λ2 )

( 2, 2) ( 3, 2)

0 mn Δ 2

(−1,−1)

(−1,0)

( 0 ,−1)

(1 , 0) ( 0, 0)

3

mn 7Δ 2

mnλ2

−1 ˆ2 mnX

2

(1 ,1)

Scalar quantizer labeling example. The top line is a uniform scalar ˆ . The quantizer that maps source points X(k) to a set of discrete representatives X second and third lines show coarser uniform scalar quantizers. The last line puts together the combination of the coarser quantizers to give an ordered pair (λ1 , λ2 ) as a label to every lattice point λ in the ﬁne quantizer. Figure 9.16

(Gersho and Gray, 1992). The mapping described above is a single-description uniform scalar quantizer. The MD scalar quantizer needs to map every source sample to an ordered pair of ˆ 2 ). The labels (λ1 , λ2 ) of this pair are used to send ˆ1, X representation points (X information over the two routes. For example, we could send the label λ1 over the ﬁrst route and label λ2 over the second route. Now, in ﬁg. 9.16 we have illustrated this by choosing coarser scalar quantizers in the second and third lines for the ˆ 2 respectively. These quantizers are also one-dimensional ˆ 1 and X representations X ˆ1, X ˆ 2 in themselves give lattices Λ1 and Λ2 respectively. These representations X coarser information about the source sample, i.e., have a larger distortion than the “ﬁner” quantizer Λ shown in the ﬁrst line. Now, we need to represent the source sample X(k) by a pair of representation points from Λ1 and Λ2 . We want to choose this pair in such a way that if either of the labels is lost due to route failure, then we are still guaranteed a certain distortion. However, if both labels are received, i.e., both routes are successful, then we need to get a smaller distortion. This means that the label pair have to mutually reﬁne each other’s representations. One such labeling technique is illustrated in ﬁg. 9.16. Each point in the coarser ˆ 2 in Λ1 is given a label λ1 and λ2 respectively. The idea ˆ 1 in Λ1 and X lattices X is then to give a pair of labels (λ1 , λ2 ) to each of the points on the ﬁne lattice Λ. Every lattice point in Λ gets a unique label pair (λ1 , λ2 ). Once this labeling function is constructed, then we can form the multiple description (MD) scalar quantizer by doing the following two steps. First, reduce the source sample X(k) ∈ R to its

282

Diversity in Communication: From Source Coding to Wireless Networks

ˆ with label λ, i.e., apply a uniform scalar quantizer closest representative in Λ, X ˆ and the labeling function, we know the pair (λ1 , λ2 ) that to X(k). Given this X, ˆ The second step is to associate X ˆ 1 with the reconstruction given by represents X. ˆ 2 in Λ2 . These the label λ1 in the ﬁrst coarse quantizer Λ1 , and similarly for X operations are what the structure in ﬁg. 9.15 represents. Therefore, in this design, the main task is to construct the labeling function for each point in Λ. Given the label pair (λ1 , λ2 ), the encoder sends the index associated with λ1 on route 1 and the index for λ2 on route 2. Before describing the labeling function, we examine the decoder structure in the MD scalar quantizer described above. First recall that the labeling function is designed so that any particular pair (λ1 , λ2 ) is uniquely associated with a particular λ. Therefore, if both routes succeed, then the receiver is able to reconstruct λ and as ˆ This means that the distortion in this case is that associated with a consequence X. 2 the ﬁne quantizer Λ, i.e., the average ﬁdelity is Δ 12 . Now suppose route 1 succeeds and route 2 fails, then the receiver has only λ1 and does not know λ2 . For example, suppose in ﬁg. 9.16, the label pair (−1, 0) was chosen at the source encoder, i.e., λ1 = −1, λ2 = 0. Now, the receiver knows that the encoder was trying to send one of the two points (−1, 0) or (−1, −1) and since route two failed it does not know which. More generally, in this situation, the receiver knows that λ belongs to the set of points in Λ which have the same ﬁrst label λ1 but have diﬀerent second label ˆ 1 associated with label λ2 . Now, assume that the decoder uses the reconstruction X λ1 = −1 in Λ1 (see second line in ﬁg. 9.16). Therefore, for this particular example, the worst-case error due to this choice is 94 Δ2 . This example also shows that the labeling function directly aﬀects the decoder distortion. The design of the labeling ˆ 1 can use function is the central part of the MD quantizer. The reconstruction X the mean of the set of all points in Λ associated with the same ﬁrst label λ1 , which may improve the distortion. Note that in general this might not coincide with the reconstruction associated with λ1 . For design simplicity this reconstruction need not be taken into account in designing the labeling function, but rather can be used only at the decoder to improve the ﬁnal distortion. In general, we would need to construct a labeling function for all the points in Λ. However, we describe a particular design which solves a smaller problem and then expands its solution to Λ (Diggavi et al., 2002c; Vaishampayan et al., 2001). We will illustrate this idea using the example shown in ﬁg. 9.16. In the last line of ﬁg. 9.16, we have depicted the overlay of the two coarse onedimensional lattices Λ1 , Λ2 along with Λ. We see that there is a repetitive pattern after every six points in Λ. This is not a coincidence, because Λ1 was formed by taking every second point in Λ and Λ2 by taking every third. The least common multiple is 6 and therefore we would expect the pattern to repeat. The basic idea is to just form a labeling function for these six points and then “shift” these labels to tile the entire lattice Λ. For example, in ﬁg. 9.16, consider the point which we have labeled as (2, 2) on the last line. This was done in the following manner. Notice that the repeating pattern of six points can be anchored by the points where both the Λ1 and Λ2 points coincide. In ﬁg. 9.16, these are the points which have overlapped

9.5

Route Diversity

283

circles on the last line. We can think of all points in Λ with respect to these anchor points. For example, the point labeled (2, 2) is one point to the left of such an overlap point and is “equivalent” to the point labeled (−1, 0). More precisely, it is in the same coset as the other point with respect to the “intersection” lattice Λs , which is formed by the anchor points. Therefore, we get the label by shifting the label of (−1, 0) with respect to its cosets. In this case, note that λ1 = −1 in Λ1 is two points to the left of the anchor point (0, 0). Therefore, the corresponding point with respect to the anchor point (3, 2) is λ1 = 2 and hence the ﬁrst label for the point of interest is λ1 = 2. Next, the corresponding point of the label λ2 = 0 in Λ2 with respect to the anchor point (3, 2) is λ2 = 2. This gives us the label (2, 2) which is shown in the ﬁg. 9.16. In a similar manner, given the labeling for the six points, we can construct the labeling for all points in Λ by the shifting technique described above. Actually, the six points correspond to the discrete Voronoi region of the point (0, 0) of the intersection lattice of the anchor points. Therefore, we can focus on constructing labels for the points in the Voronoi region of the intersection lattice. Note that in the example of ﬁg. 9.16, the intersection lattice had an index of six which is exactly the least common multiple of the indices of lattices Λ1 , Λ2 in Λ. This is also true when the indices of Λ1 , Λ2 in Λ are not coprime (Diggavi et al., 2002c). Let VΛs :Λ (0) be deﬁned as the Voronoi region of the intersection lattice. Our problem is to develop the labeling function for the points in VΛs :Λ (0) in order to satisfy the individual distortion constraints D1 , D2 . This is accomplished by using a Lagrangian formulation in Diggavi et al. (2002c). This formulation reduces to ﬁnding the labeling scheme α(λ) = (α1 (λ), α2 (λ)) so as to minimize, 0 1 γ1 λ − α1 (λ) 2 + γ2 λ − α2 (λ) 2 . (9.41) λ∈VΛs :Λ (0)

For this minimization problem we need to choose the appropriate labels (α1 (λ), α2 (λ)) = (λ1 , λ2 ). This is done by observing the following identity. γ1 λ − λ1 2 + γ2 λ − λ2 2 =

γ1 γ2 γ1 λ1 + γ2 λ2 2 λ2 − λ1 2 + (γ1 + γ2 ) λ − . γ1 + γ2 γ1 + γ2

This results in the following design guideline. The labeling problem is split into two parts: (1) Choose |VΛs :Λ (0)| “shortest” pairs (λ1 , λ2 ) (not all pairs of (λ1 , λ2 ) are used). (2) Assign these pairs to lattice points λ ∈ VΛs :Λ (0). The second design can be solved very eﬃciently using linear programming methods. The solution of this labeling problem illustrates an important feature of the MD quantizer design that is quite distinct from the single-description case. It can happen that particular labels of each description can be noncontiguous, i.e., not all points λ which get the same label—say, λ1 —need to occur contiguously. This is quite diﬀerent from the single-description case, where the labels are assigned to contiguous intervals. Also, the labels generated in this systematic manner are nontrivial and diﬃcult to handcraft.

284

Diversity in Communication: From Source Coding to Wireless Networks

11

6

14 14,7

4 8

4,11

3 7,3

13,3 3,7

13 7

7

15 15,4

2

4,8

4,3

7,0

3,3

4,0

0,3

3,0

4 16,8

0

7,7

12

4,4

8,8 8 4 8,4

3,2

8,0

0,4

0 0,0

0,2

6,0

1,4

12 0,1

1,0

2,2

2,0

1

6,6

12,6

6 1,5

−2

1,1

2,1

5,0

2,6

11,2

2 5

9,1

5,5

1 5,1

6

2,9

5

11 10,5

−4

Figure 9.17

10

2 6,2

0

16 1,12

−6 −6

3,10

3

9 10 9 −4

−2

0

2

4

6

Labels for a two-dimensional integer lattice example.

The labeling scheme described for the scalar quantizer actually illustrates a more general principle which is applicable to MD vector quantizers (Diggavi et al., 2002c). We use a chain of lattices as illustrated in ﬁg. 9.15, i.e., we use a ﬁne lattice Λ and two coarser sublattices Λ1 , Λ2 . These lattices have an intersection lattice Λlcm one of whose Voronoi regions is what we label. The idea of using sublattice shifts as done above to generate the labels using only the labels of this Voronoi region can also be generalized (Diggavi et al., 2002c). One such example of the labels of the Voronoi region for a two-dimensional lattice is shown in ﬁg. 9.17. Therefore, the vector quantizer proceeds as follows. We ﬁrst reduce point X T ∈ R T using a ﬁne lattice Λ, and then using the labeling function we ﬁnd (λ1 , λ2 ). Then as before λ1 is sent over the ﬁrst route and λ2 is sent over the second route. The decoder also proceeds in a manner similar to the scalar quantizer described above. As seen above, the crux of the MD quantizer design problem is to construct the appropriate labeling function. In Diggavi et al. (2002c), it is shown that an appropriate labeling function, along the lines described for the scalar quantizer, can be constructed very eﬃciently using a linear program. In fact, Diggavi et al. (2002c) shows that such a labeling scheme is very close to being optimal in terms of the rate distortion result given in theorem 9.12 in the high-rate regime.

9.6

Discussion

285

9.5.3

Network Protocols for Route Diversity

In order to utilize route diversity in a network, one of the most important components is clearly the design of MD source coding techniques studied in section 9.5.2. However, an equally important question is the design of routing techniques that can enable the use of MD source coding. In this section we brieﬂy examine these issues from a networking point of view. In order to create route diversity, we need to have multiple routes which are disjoint, in that they do not share common links. This can be done through IP source routing (Keshav, 1997). Source routing is a technique whereby the sender of a packet can specify the route that a packet should take through the network. In the typical IP routing protocol, each router will choose the next hop to forward the packet by examining the destination IP address. However, in source routing, the “source” (i.e., the sender) makes some or all of these decisions. In strict source routing (which is virtually never used), the sender speciﬁes the exact route the packet must take. The more common form is loose source record route (LSRR), in which the sender gives one or more hops that the packet must go through. Therefore, the sender can take an MD code and send each of the descriptions using diﬀerent routes by explicitly specifying them the IP source routing protocol. An alternate technique might be to use an overlay network where there is an application that collects the diﬀerent descriptions and sends them through diﬀerent relay nodes in order to create route diversity. This discussion shows that creating route diversity is architecturally not diﬃcult even using the provisions within the IP (Keshav, 1997). This discussion from a networking point of view also exposes the inherent interactions required between the routing and application layers of the networking protocol stack. Such “interlayer” interactions become particularly important in wireless networks, where route failures could occur more frequently than in wired networks. Therefore, in this case diversity, albeit at a much higher layer in the IP stack, again becomes quite important.

9.6

Discussion In this chapter we studied the emerging role of diversity with respect to three disparate topics. The idea of using multiple instantiations of randomness attempts to turn the presence of randomness to an advantage. For example, in multipleantenna diversity, the degrees of freedom provided by the space diversity is utilized for increased rate or reliability. In mobile ad hoc networks, the random mobility is utilized to route information from source to destination. To realize the beneﬁts promised by the use of diversity, we need to have interactions across networking layers. For example, in opportunistic scheduling (studied in section 9.4.1) the transmission rates that can be supported by the physical layer interact with the resource allocation (scheduling), which is normally only a functionality of the data-link layer. In the multi-user diversity studied in

286

NOTES

mobile ad hoc networks (see section 9.4.2) the routing of the packets interacted with the physical layer transmission. Finally, the MD source coding studied in section 9.5 necessitated an interaction between source coding (application-layer functionality) and routing. These examples of cross-layer protocols are increasingly becoming important in reliable network communication. Diversity is the common thread among several of these cross-layer protocols. The advantages of using diversity in these contexts are just beginning to be realized in practice. There might be many more areas where the ideas of using diversity could have an impact, and this is a topic of ongoing research.

Notes

1 The term suﬃcient statistics refers to a function (perhaps many-to-one) which does not cause loss of information about the random quantity of interest. 2 To be precise, we need to sample equation 9.1 at a rate larger than 2(W + W ), where W is s I I the input bandwidth and Ws is the bandwidth of the channel time variation (Kailath, 1961). 3 In passband communication, a complex signal arises due to in-phase and quadrature phase modulation of the carrier signal, see Proakis (1995). 4 This can be seen by noticing that for M = 1, a suﬃcient statistics is an equivalent scalar t ¯ where h ¯ denotes channel, y˜(b) = h(b)∗ y(b) = ||h(b) ||2 x(b) + h(b)∗ z(b) . In this chapter, |h|2 = hh, complex conjugation, and for a vector h we denote its 2-norm by ||h||2 = h∗ h, where h∗ denotes the Hermitian transpose and ht denotes ordinary transpose. 5 The assumption that {H(b) } is i.i.d. is not crucial. This result is (asymptotically) correct even when the sequence {H(b) } is a mean ergodic sequence (Ozarow et al., 1994). We use the notation H to denote the channel matrix H(b) for a generic block b. 6 For a matrix A, we denote its determinant as det(A) and |A|, interchangeably. 7 In Foschini (1996), a similar expression was derived without illustrating the converse to establish that the expression was indeed the capacity. 8 Here the notation o(1) indicates a term that goes to zero when SN R → ∞. 9 For an information rate of R bits per transmission and a block length of T , we deﬁne the codebook as the set of 2T R codeword sequences of length T . 10 A constellation size refers to the alphabet size of each transmitted symbol. For example, a QPSK modulated transmission has constellation size of 4. 11 We use the notation f (n) = Θ(g(n)) to denote f (n) = O(g(n)) as well as g(n) = O(f (n)). f (n) Here f (n) = O(g(n)) means lim supn→∞ | g(n) | < ∞. 12 Interestingly, this result is speciﬁcally for two descriptions and does not immediately extend to the general case.

10

Designing Patterns for Easy Recognition: Information Transmission with Low-Density Parity-Check Codes

Frank R. Kschischang and Masoud Ardakani

10.1

Introduction Coding for information transmission over a communication channel may be deﬁned as the art of designing a (large) set of codewords such that (i) any codeword can be selected for transmission over the channel, and (ii) the corresponding channel output with very high probability identiﬁes the transmitted codeword. Low-density parity-check codes represent the current state-of-the-art in channel coding. They are a family of codes with ﬂexible code parameters, and a code structure that can be ﬁne-tuned so that decoding can occur at transmission rates approaching the information-theoretical limits established by Claude Shannon, yet with “practical” decoding complexity. In this chapter—which is aimed at the non-expert—we show that these codes are easy to describe using probabilistic graphical models, and that their simplest decoding algorithms (the “sum-product” or “belief-propagation” algorithm, and variations thereof) can be understood as message-passing in the graphical model. We show that a simple Gaussian approximation of the messages passed in the decoder leads to a tractable code-optimization problem, and that solving this optimization problem results in codes whose performance appears to approach the Shannon limit, at least for some channels. Communication channels are typically modeled (with no essential loss of generality) in discrete time. At each unit of time a channel accepts (from the transmitter) a “channel input symbol” and produces (for the receiver) a corresponding “channel output symbol” according to some probabilistic channel model. Information is usually transmitted by using the channel many times, i.e., by transmitting many channel input symbols. In so-called “block coding,” messages are mapped to sequences (x1 , . . . , xn ) of channel inputs of a ﬁxed block-length n. A code is a set of “valid codewords” agreed upon by the transmitter and receiver prior to communication. A code is typically a (carefully selected) subset of the set of all possible channel inputs of length n. Transmission of a codeword gives rise (at the receiver)

288

Designing Patterns for Easy Recognition

to a “received word,” an n-tuple (y1 , . . . , yn ) of channel output symbols. The task of the receiver is to infer from the received word, which codeword—and hence which message—was (ideally, most likely) transmitted. Alternatively, if the message consists of many symbols, the receiver may wish to infer the most likely value of each message symbol. By attempting to determine which codeword was most likely transmitted, a decoding algorithm attempts to solve a noisy pattern recognition problem. There are, therefore, some similarities between the ﬁelds of coding theory and pattern recognition. However, a key diﬀerence that makes the two ﬁelds quite distinct is the fact that the set of valid codewords is under the control of the system designer in the former case, but not (usually) in the latter case. In other words, in coding theory the system designer is given the luxury of choosing the set of patterns to be recognized by the decoding algorithm. A major theme in coding theory research is, therefore, to optimize or ﬁne-tune the structure of the code for eﬀective recognition (decoding) by some particular class of decoding algorithms. Another major diﬀerence between coding theory and typical pattern-recognition problems is the sheer number of patterns to be recognized by the decoding algorithm. In many pattern-recognition problems, the number of diﬀerent patterns (or pattern classes) is relatively small. In coding theory, the numbers can be extraordinarily large. The transmission of k bits corresponds to the selection, by the transmitter, of one codeword from a code of 2k possible codewords. Thus, transmission of single bit requires a code of just two codewords; transmission of two bits requires a code of four codewords, and so on. Typical values of k for codes used in practice range from less than a dozen bits to tens of thousands of bits, and hence the number of diﬀerent codewords to be “recognized” can be as large as 210,000 or more! Despite these huge numbers, decoding algorithms routinely make rapid decoding decisions, reliably producing decoded information at many megabits per second. A key parameter of a code is its rate. A code of 2k codewords, each having block-length n, is said to have a rate of R = k/n bits/symbol (or bits per channeluse). Clearly the rate of a code is a measure of the “speed” at which information is transmitted, normalized per channel-use. To convert from bits per symbol to bits per second, one needs to know the number of symbols that may be transmitted per second, a value that typically scales linearly with the “channel bandwidth.” Channel bandwidths can vary greatly, depending on the application; thus, from the point of view of code design, it is more appropriate to focus on code rate measured in bits/symbol (rather than bits/s). Given a particular channel, one would clearly like to make the code rate as large as possible. On the other hand, one also desires to make reliable decoding decisions, i.e., decisions for which the probability of error approaches zero. At ﬁrst glance, it may seem that there should be a trade-oﬀ between transmission rate and reliability, i.e., for a ﬁxed k, intuition would suggest that for some suﬃciently large n it should be possible to design a code so that the probability of decoding error can be made smaller than any chosen > 0.

10.2

A Brief Introduction to Coding Theory

289

This would certainly be true for the transmission of k = 1 bit over a binary symmetric channel with “crossover probability” p < 1/2. Such a channel accepts xi ∈ {0, 1} at its input and produces a corresponding yi ∈ {0, 1} at its output. Each transmitted symbol is independently “ﬂipped” with probability p, i.e., with probability p we have yi = xi . A single bit can be transmitted with a repetition code of two codewords {000 · · · 0, 111 · · · 1}, where both codewords have the same length n. This code can be decoded according to a “majority rule:” if the majority of the symbols in the received word are zero, then decode to the all-zero codeword; otherwise decode to the all-one codeword. It is easy to see that if n is made suﬃciently large, then the probability of error under the majority rule can be made arbitrarily small for any p < 1/2. In this example, there is a smooth trade-oﬀ between rate and reliability; as the rate decreases, the reliability increases, and one might be inclined to believe that such is the trade-oﬀ in general. In fact, although there is a fundamental trade-oﬀ between code rate and reliability, the trade-oﬀ is abrupt (like a step-function), not smooth. In his seminal 1948 paper (Shannon, 1948), Claude E. Shannon established the remarkable fact that typical communication channels are characterized by a so-called channel capacity, C, with the property that reliable communication (i.e., communication with probability of error approaching zero) is possible for every R < C. More precisely, Shannon showed that for every R < C and every > 0, by choosing a suﬃciently large block-length n, there exists a block code of length n with at least 2nR codewords and a decoding algorithm for this code that yields a probability of decoding error smaller than . Conversely, one may also show that if R > C, then, even using an algorithm that minimizes error probability, it is impossible to achieve arbitrarily small probability of error. Thus, to achieve arbitrarily good reliability, it is not necessary to have R → 0, but rather only to have R < C. Information theory allows us to reﬁne our stated goal of coding for information transmission over a communication channel with capacity C. The goal is to design a code such that (i) any codeword can be selected for transmission over the channel, (ii) the corresponding channel output can be processed by an algorithm of “practical” complexity to identify (with very high probability) the transmitted codeword, and (iii) the rate of the code is “close” to C. In the remainder of this chapter, we will show how low-density parity-check (LDPC) codes achieve this goal.

10.2

A Brief Introduction to Coding Theory Low-density parity-check codes are binary linear block codes (though it is possible to deﬁne non-binary versions as well). Accordingly, we begin by deﬁning what this means. Let F2 denote the ﬁnite ﬁeld of two elements {0, 1}, closed under modulo-two integer addition and multiplication. This ﬁeld has the simplest possible arithmetic: ¯, where x ¯ denotes the for all x ∈ F2 , under addition we have 0 + x = x and 1 + x = x complement of x (thus a two-input F2 -adder is an exclusive-OR gate), and under

290

Designing Patterns for Easy Recognition

multiplication we have 0 · x = 0 and 1 · x = x (thus a two-input F2 -multiplier is an AND gate). For every positive integer n, we let F2n denote the set of n-tuples with components from F2 , which forms a vector space over F2 equipped with the usual component-wise vector addition and with multiplication by scalars from F2 . As is the convention in coding theory, we will always think of such vectors as row vectors. By deﬁnition, a binary linear block code of block-length n and dimension k is a k-dimensional subspace of F2n . It follows that a binary linear block code is itself closed under vector addition and multiplication by scalars; in particular, the sum of two codewords is another codeword, and the code certainly always contains the allzero vector (0, 0, . . . , 0). (Although it is certainly possible to deﬁne nonlinear codes, i.e., codes that are general subsets of F2n , not necessarily subspaces, most codes used in practice are linear codes.) A binary linear code of length n and dimension k will be denoted as an [n, k] code. Such a code has 2k codewords, and hence has rate R = k/n. We will only ever consider such codes with k > 0. From now on, when we write “code,” we mean binary linear block code. How can an [n, k] code be speciﬁed? One way is to observe that such a code C is a k-dimensional vector space, and hence it has a basis (in general, many bases), i.e., a set {v1 , v2 , . . . , vk } of linearly independent vectors that span C. These vectors can be collected together as the rows of a k × n matrix G, called a generator matrix for C, with the evident property that C is the row space of G. For every distinct u ∈ F2k we obtain a distinct codeword v = uG ∈ C. Hence, a generator matrix yields a way to implement an encoder for C, by simply mapping a message u ∈ F2k under matrix multiplication by G to the codeword v = uG ∈ C. Thus a code C may be speciﬁed by providing a generator matrix for C. Another way to specify an [n, k] code C—and the one we will use to deﬁne lowdensity parity-check codes—is to view C as the solution space of some homogeneous system of linear equations in n variables X1 , X2 , . . . , Xn . Since, in F2 , there are only two possible scalars, the structure of any one such equation is exceedingly simple: it is always of the form Xi1 + Xi2 + · · · + Xim = 0,

(10.1)

where {i1 , i2 , . . . , im } is some subset of {1, 2, . . . , n}. Such an equation is sometimes referred to as a parity check, since it speciﬁes that the “parity” (the number of ones) in the subset of the variables indexed by {i1 , . . . , im } should be even, i.e., (10.1) is satisﬁed if and only if an even number of Xi1 , Xi2 , . . . Xim take value one. If we deﬁne the n-tuple h as the vector with value one in components i1 , i2 , . . . , im , and value zero in all other components, then (10.1) may also be written as (X1 , . . . , Xn )hT = 0, where hT denotes the “transpose” of h. Given a system of (n − k) parity-check equations, we may collect the corresponding h-vectors to form the rows of an (n−k)×n matrix H, called a parity-check matrix. The set C = {x : xH T = 0}

10.3

Message-Passing Decoding of LDPC Codes

291

of all possible solutions to this system of equations, i.e., the set of vectors that satisfy all parity checks, is then a code of length n and dimension at least k (and possibly more, if the rows of H are not linearly independent). Thus a code C may be speciﬁed by providing a parity-check matrix for C. Note that, whereas a generator matrix for C gives us a convenient encoder, a parity-check matrix H for C gives us a convenient means of testing a vector for membership in the code, since a given vector r ∈ F2n is a codeword if and only if rH T = 0. More generally, parity-check matrices are useful for decoding since, if r is a non-codeword, it is the structure of parity-check failures in the so-called syndrome rH T that provides evidence about which bits of r need to be changed in order to recover a valid codeword. Low-density parity-check (LDPC) codes have the special property that they are deﬁned via a parity-check matrix that is sparse, i.e., by an H matrix that has only a small number of nonzero entries. If the H matrix has a ﬁxed number of one in each row and a ﬁxed number of ones in each column, then the corresponding code is called a regular LDPC code; otherwise, the code is an irregular code. As an example, one of the earliest families of LDPC codes—the so-called (3,6)-regular LDPC codes, deﬁned by R. G. Gallager at MIT in the early 1960s (Gallager, 1963)—have an H matrix with exactly 6 ones in each row and 3 ones in each column. If we take n reasonably large, say n = 2000, the matrix contains just 3n = 6, 000 ones, whereas a binary matrix of the same size generated by ﬂipping a fair coin for each matrix entry, would on average contain one million ones. Thus we see that the H matrix is very sparse indeed. Why is sparseness important? The answer lies in the nature of the decoding algorithm, which we describe next.

10.3

Message-Passing Decoding of LDPC Codes 10.3.1

From Codes to Graphs

The relationship between variables and equations can be visualized using a graph, such as the one shown in Figure 10.1. This graph (called a Forney-style factor graph (Forney, 2001; Loeliger, 2004)) consists of various vertices and edges (as is conventional in graph theory), and (somewhat unconventionally) also includes a number of “half-edges,” which are edges incident on a single vertex only. The half-edges are denoted as ‘⊥’ in Figure 10.1. Edges and half-edges represent binary variables. A “conﬁguration” is an assignment of a binary value to each edge and half-edge. Certain conﬁgurations will be regarded as “valid conﬁgurations,” and all others as invalid. Vertices in the graph represent “local constraints” that the variables must satisfy in order to form a valid conﬁguration. So-called “equality constraints” (or “equality vertices”), denoted with ‘=’ in Figure 10.1, constrain all neighboring variables (i.e., all incident edges) to take on the same value in every valid conﬁguration, whereas so-called

292

Designing Patterns for Easy Recognition

A factor graph for a (very small) (3,6)-regular LDPC code. Each edge (and half-edge) represents a binary variable. The boxes labeled = are equality constraints that enforce the rule that all incident edges are to have the same value in every valid conﬁguration. The boxes labeled + are parity-check constraints that enforce the rule that all incident edges are to have even parity (zero-sum, modulo two).

Figure 10.1

“parity-check constraints” (or “check vertices”), denoted with ‘+’ in Figure 10.1, constrain the neighboring variables to form a conﬁguration having an even number of ones, i.e., a modulo-two sum of zero. The valid conﬁgurations are precisely those that satisfy all local constraints. The half-edges represent the codeword symbols v1 , v2 , . . . , v18 . Half-edges can be viewed as the “interface” (or “read-out”) between the conﬁguration space induced by the internal structure of the graph, and the desired “external” behavior. Equivalently, the full edges in the graph may be regarded as hidden (or “auxiliary” or “state”) variables, and the half-edges as observed or primary variables. The equality constraints essentially serve to “copy” the value of each codeword symbol in a valid conﬁguration to the neighboring (full) edges. Each of the parity-check constraints implements a single parity-check equation; for example, the highlighted parity-check constraint in Figure 10.1 essentially implements the equation v3 + v6 + v9 + v12 + v15 + v18 = 0; however, instead of involving the variables v3 , v6 , etc., directly, the highlighted parity-check constraint vertex involves copies of these variables. It should now be clear that the set of valid conﬁgurations projected on the halfedges in Figure 10.1 form a binary linear code satisfying the 9 diﬀerent parity-check equations implemented by the check vertices. Indeed, as the reader may verify by

10.3

Message-Passing Decoding of LDPC Codes

tracing edges, this ⎡ 1 1 ⎢ 1 0 ⎢ 0 1 ⎢ 0 0 ⎢ H=⎢ ⎢ 0 0 ⎢ 1 0 ⎢ 0 1 ⎣ 0 0 0 0

293

code is deﬁned by the parity-check matrix 1 0 0 1 0 1 0 0 0

1 1 0 0 0 0 1 0 0

1 1 0 0 0 0 0 1 0

1 0 1 1 0 0 0 0 0

0 0 1 0 1 0 0 1 0

0 0 0 0 1 1 1 0 0

0 1 0 1 1 0 0 0 0

0 0 1 0 1 0 1 0 0

0 0 1 0 1 1 0 0 0

0 0 0 1 1 1 0 0 0

0 0 0 0 0 0 1 1 1

0 1 0 0 0 0 0 1 1

0 0 1 1 0 0 0 0 1

0 1 0 0 0 0 0 1 1

0 0 0 0 0 1 1 0 1

0 0 0 1 0 0 0 1 1

⎤ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎦

It is clear that H encodes the incidence structure of the graph, with an edge corresponding to each nonzero entry of H. If a nonzero entry occurs in row i and column j, then the edge connects the ith check vertex with the jth equality vertex. Because of the correspondence between H and the factor graph, sparseness of H implies sparseness of the graph and vice versa. 10.3.2

Channel Models

As noted above, communication channels are typically modeled in discrete time: a channel accepts a channel input xi and produces a corresponding channel output yi , according to some probabilistic model. For example, the binary symmetric channel described earlier accepts, at time i, a binary digit xi ∈ {0, 1} at its input, and produces a binary digit yi ∈ {0, 1} at its output, with the property that yi = xi with probability 1 − p (and therefore yi = xi with probability p). The parameter p is called the cross-over probability of the channel. The binary symmetric channel is assumed to be memoryless, which means that, given the ith channel input xi , the channel output yi is independent of all other channel inputs and outputs, i.e., (assuming n channel inputs and outputs in total) p(yi |x1 , . . . , xn , y1 , . . . , yi−1 , yi+1 , . . . , yn ) = p(yi |xi ). The capacity C(p) of the binary symmetric channel with crossover probability p is given by (Shannon, 1948) C(p) = 1 − H(p) bits/symbol where H(p) denotes the binary entropy function H(p) = −p log2 p − (1 − p) log2 (1 − p). The capacity is plotted in Figure 10.2(a). We will also consider the binary-input additive white Gaussian noise (AWGN) channel, which at time i accepts a channel input xi ∈ {−1, 1}, and produces the (real-valued) output yi = xi +ni , where ni is a zero-mean Gaussian random variable with variance σ 2 . This channel is also assumed to be memoryless, and serves as a model of the widely implemented continuous-time transmission schemes known as

294

Designing Patterns for Easy Recognition

binary phase-shift keying (BPSK) and quadrature phase-shift keying (QPSK). The capacity C(σ) of the binary-input additive white Gaussian noise channel with noise variance σ 2 is given by

∞ 1 C(σ) = 1 − √ exp(−u2 /2) log2 (1 + exp(−2(σu + 1)/σ 2 ))du bits/symbol. 2π −∞ This function is plotted in Figure 10.2(b). Also plotted is the function 1 log2 (1 + 1/σ 2 ), 2 which represents the capacity of an additive white Gaussian noise channel in which the channel input is constrained to have unit second moment, but is unconstrained in value. Instead of using the noise variance σ as a parameter of an AWGN channel, one often encounters the so-called “bit-energy to noise-density ratio,” denoted Eb /N0 . This terminology arises in the context of continuous-time AWGN channels, in which

(a)

(b)

Capacity as a function of channel parameter: (a) the binary symmetric channel with crossover probability p; (b) the binary-input additive white Gaussian noise channel with noise variance σ 2 .

Figure 10.2

10.3

Message-Passing Decoding of LDPC Codes

295

the one-sided noise power spectral density is typically parameterized by the value N0 . For a code of rate R and a noise variance of σ 2 , we have Eb 1 = . N0 2Rσ 2 Often the value of Eb /N0 is quoted as a value in decibels (dB), i.e., the value quoted is 10 log10 (Eb /N0 ). Thus, for example, a code of rate R operating in an AWGN channel with an Eb /N0 of x dB is operating in a channel of noise variance σ2 =

10−x/10 . 2R

In the coding literature one also often encounters the term “Shannon limit.” The Shannon limit for a code of rate R is the value of the channel parameter corresponding to the worst channel that (in principle) could be used with a code of that rate, i.e., the channel parameter for which the capacity of the channel is R. In the case of AWGN channel, the performance of a coding scheme is often quoted in terms of a distance (in dB) from the Shannon limit. Thus, if the code achieves acceptable performance at a noise variance σ 2 , but the corresponding Shannon limit is σ02 > σ 2 , then the distance to the Shannon limit is given as 10 log10 (σ02 /σ 2 ) dB. 10.3.3

From Graphs to Decoding Algorithms

Most decoding algorithms for low-density parity-check codes operate by passing “messages” along the edges of the graph describing the code. We will begin by providing an intuitive description of this process. Initially, messages are derived from the channel outputs. The channel outputs are translated into a a “belief” about the value of the corresponding codeword symbol, where a “belief” is a guess about the value (zero or one) along with a measure of conﬁdence in that guess. Unfortunately, due to channel noise, some of these beliefs are, in fact, erroneous. Beliefs about each codeword symbol are communicated to the check vertices. By enforcing the rule that in a valid conﬁguration the modulo-two sum of the bit values is zero, the checks can update the beliefs. For example, if the beliefs received at a check form a conﬁguration that does not satisfy the zero-sum rule, then the bit with the least conﬁdence could be informed that it should probably alter its belief. The process of sending messages from equality vertices to check vertices and back again is called an “iteration,” and after several iterations (depending on the code and the noise in the channel), the beliefs about the symbols, with high probability, reﬂect the transmitted conﬁguration. Now it becomes clear why sparseness in the graph is important. Firstly, the total amount of computation required per iteration is proportional to the number of edges in the decoding graph, and this number is exactly equal to the number of nonzero entries in the H matrix. Thus, the more sparse the graph, the smaller the decoding complexity. Secondly, sparseness helps to make it diﬃcult for short cycles in the graph (which can sometimes cause reinforcement of erroneous beliefs)

296

Designing Patterns for Easy Recognition

to inﬂuence the decoder unduly. We also notice that this decoding algorithm naturally supports parallelism, as the message transfer between checks and variables can, in principle, all occur simultaneously. This observation is the foundation for a number of hardware implementations for LDPC decoders, in which processing nodes correspond directly to factor-graph vertices, and wires connecting these nodes correspond directly to factor-graph edges. We will now give a more precise description of message-passing decoding, starting with the so-called sum-product algorithm. See (Kschischang et al., 2001) for more details. Messages passed on an edge during decoding are probability mass functions for the corresponding binary variable. A probability mass function p(x) for a binary variable X can be encoded with just a single parameter (e.g., p(0), p(1), p(0)−p(1), p(0)/p(1), etc.), and hence messages are real-valued scalars. A very commonly used parametrization, is the log-likelihood ratio (LLR), deﬁned as ln(p(0)/p(1)). Note that the sign of an LLR value indicates which symbol-value (0 or 1) is more likely, and so can be used to make a decision on the value of that symbol. The magnitude of the LLR can be interpreted as a measure of conﬁdence in the decision; a large magnitude indicates a large disparity between p(0) and p(1), and hence a greater conﬁdence in the truth of the decision. The “neutral message” corresponding to the uniform distribution will be denoted as μ0 . If an LLR representation is used, then μ0 = 0. Messages are always directed (along an edge or half-edge) in the graph. A message on a half edge directed to a vertex v will be denoted as μ→v , and a message on a full edge directed from a vertex v1 to a vertex v2 will be denoted as μv1 →v2 . We will denote the set of neighbors of a vertex v1 as N (v1 ). If v2 ∈ N (v1 ), then N (v1 ) \ {v2 } is the set of neighbors of v1 excluding v2 . Initialization: The decoding algorithm is initialized by sending neutral messages on all edges, i.e., μv1 →v2 = μv2 →v1 = μ0 for every pair of vertices v1 , v2 connected by an edge. The half-edges are initialized with so-called “intrinsic” or “channel” messages, corresponding to the received channel output. In particular, for a binary symmetric channel with crossover probability p, the LLR associated with channel output y ∈ {0, 1} is given by 1 −1 , λ(y) = (−1)y ln p and this is the initial message sent toward the corresponding equality vertex in the graph. Similarly, for the binary-input AWGN channel with noise variance σ 2 , the LLR associated with channel output y ∈ R is given by λ(y) = 2y/σ 2 ,

10.3

Message-Passing Decoding of LDPC Codes

297

assuming that transmission of a zero corresponds to the +1 channel input, and transmission of a one corresponds to the −1 channel input. Local Updates: Messages are updated at the vertices according to the principle that the message μv1 →v2 sent from vertex v1 to its neighbor v2 is a function of the messages directed toward v1 on all edges other than the edge {v1 , v2 }. This principle is one of the pillars that leads to an analysis of the decoder; furthermore, this principle leads to optimum decoding in a cycle-free graph (see Kschischang et al., 2001). Assuming that messages are represented as LLR values, then the message sent by the sum-product algorithm from an equality vertex v1 to a neighboring check vertex v2 is given as μv →v1 , (10.2) μv1 →v2 = μ→v1 + v ∈N (v1 )\{v2 }

where μ→v1 denotes the channel message received along the half-edge connected to v1 . Similarly, the message sent from a check vertex v2 to a neighboring equality vertex v1 is given as ⎛ ⎞ tanh(μv →v2 /2)⎠ . (10.3) μv2 →v1 = 2 tanh−1 ⎝ v ∈N (v2 )\{v1 }

The hyperbolic tangent functions involved in this update rule are actually performing a change of message representation: if λ is the LLR value ln(p(0)/p(1)), then tanh(λ/2) = p(0)−p(1), the probability diﬀerence. The product of tanh(·) functions in (10.3) actually implements a product of probability diﬀerences, which can itself be seen as the diﬀerence in probabilities of local conﬁgurations having even parity with those having odd parity. Messages received on full edges at an equality vertex are referred to as “extrinsic” messages. Extrinsic messages, unlike the “intrinsic” channel message reﬂect the structure of the code and change from iteration to iteration; when decoding is successful, the quality (magnitude in an LLR implementation) of the extrinsic messages improves from iteration to iteration. In this way, the extrinsic messages can “overwhelm” erroneous channel messages, leading to successful decoding. Update Schedule: The order in which messages are updated is referred to as the “update schedule.” A commonly used schedule is to send messages from each equality vertex toward the neighboring check vertices (in any order), and then to send messages in the opposite direction (in any order). One complete such update is referred to as an “iteration,” and usually many iterations are performed before a decoding decision is reached. Other update schedules can lead to faster convergence (Sharon et al., 2004; Xiao and Banihashemi, 2004), but we will not consider these here.

298

Designing Patterns for Easy Recognition

Termination: Decoding decisions are based on all of the messages (both intrinsic and extrinsic) directed toward each equality vertex v. Decisions are made on a symbol-by-symbol basis. With LLR messages, the decision statistic is given as μv →v . μ = μ→v + v ∈N (v)

If μ > 0, then the corresponding codeword symbol value is chosen to be 0; otherwise it is chosen to be 1. Usually iterations are performed until these decisions yield a valid codeword, or until the number of iterations reaches some allowed maximum. Note that the total computational complexity (assuming a ﬁxed maximum number of iterations, and a ﬁxed distribution of vertex degrees) scales linearly with the block length of the code. In addition to the sum-product algorithm, a number of other (often simpler) message-passing algorithms have been studied. These include the “min-sum” algorithm and Gallager’s “decoding algorithm B,” which are described next. Min-sum algorithm: In the min-sum algorithm, the update rule at an equality vertex is the same as the sum-product algorithm (10.2), but the update rule at a check vertex v2 is simpliﬁed to μv2 →v1 = min |μv →v2 | · sign(μv →v2 ). (10.4) v ∈N (v2 )\{v1 }

v ∈N (v2 )\{v1 }

Notice that the tanh−1 of the product of tanh’s is approximated as the minimum of the absolute values times the product of the signs. This approximation becomes more accurate as the magnitude of the messages is increased. Gallager’s decoding algorithm B: In this algorithm, introduced by Gallager (1963), the message alphabet is {0, 1}. In other words, the messages communicate “decisions” only, without an associated reliability. The update rule at a check vertex v2 is > μv2 →v1 = μv →v2 , (10.5) v ∈N (v2 )\{v1 }

where ⊕ represents the modulo-two sum of binary messages. At an equality vertex v1 of degree dv + 1, the outgoing message μv1 →v2 is ' μ→v1 if ∃v1 , v2 , . . . , vb ∈ N (v1 ) \ {v2 } : μv1 →v1 = · · · = μvb →v1 = μ→v1 μv1 →v2 = , μ→v1 otherwise (10.6) where b is an integer in the range & dv2−1 ' < b < dv . Here, the outgoing message of an equality vertex is the same as the intrinsic message, unless at least b of the extrinsic messages disagree. The value of b may change from one iteration to another. The optimum value of b for a (dv , dc )-regular LDPC code (i.e., a code with an H-matrix having dc ones in every row and dv ones in every column) was computed by Gallager

10.4

LDPC Decoder Analysis

299

(1963) and is the smallest integer b for which 2b−dv +1 1 + (1 − 2pe )dc −1 1−p ≤ , p 1 − (1 − 2pe )dc −1

(10.7)

where p and pe are channel crossover probability (intrinsic message error rate) and extrinsic message error rate, respectively. It can be proved that Algorithm B is the best possible binary message-passing algorithm for regular LDPC codes.

10.4

LDPC Decoder Analysis 10.4.1

Decoding Threshold

For a binary symmetric channel with parameter p < 1/2 and an AWGN channel with parameter σ, the performance of an iterative decoder degrades with increasing channel parameter. Richardson and Urbanke (2001) studied the performance of families of low-density parity-check codes with a ﬁxed proportion of check and equality vertices of certain degree. In the limit as the block length goes to inﬁnity (so that the neighbors, next-neighbors, next-next-neighbors, etc., of each vertex, taken to a particular depth, can be assumed to form a tree), they show that the family exhibits a threshold phenomenon: there is a “worst-channel” for which (almost all) members of the family have a vanishing error probability as the block length and number of iterations go to inﬁnity. This channel condition is called the threshold of the code family. For example, the threshold of the family of (3,6)-regular codes on the AWGN channel under sum-product decoding is 1.1015 dB, which means that if an inﬁnitely long (3,6)-regular code were used on an AWGN channel, convergence to zero error rate is almost surely guaranteed whenever Eb /N0 is greater than 1.1015 dB. If the channel condition is worse than the threshold, a non-zero error rate is assured. In practice, when ﬁnite-length codes are used, there is a gap between the Eb /N0 required to achieve a certain (small) target error probability and the threshold associated with the given family, but this gap shrinks as the code length increases. The main aim in the asymptotic analysis of families of LDPC codes is to determine the threshold associated with the family, and one of the aims of code design is to choose the parameters of the family so that the threshold can be made to approach channel capacity. 10.4.2

Extrinsic Information Transfer (EXIT) Charts

An iterative decoder can be thought of as a “black box” that at each iteration takes two sources of knowledge about the transmitted codeword—the intrinsic information and the extrinsic information—and attempts to obtain an “improved” knowledge about the transmitted codeword. The “improved” knowledge is then used

300

Designing Patterns for Easy Recognition

as the extrinsic information for the next iteration. When decoding is successful, the extrinsic information gets better and better as the decoder iterates. Therefore, in all methods of analysis of iterative decoders, statistics of the extrinsic messages at each iteration are studied. For example, one might study the evolution of the entire probability density function (pdf) of the extrinsic messages from iteration to iteration. This is the most complete (and probably most complex) analysis, and is known as density evolution (Richardson and Urbanke, 2001). However, as an approximate analysis, one may study the evolution of a representative or an approximate parametrization of the true density. An example of this approach is to use so-called “extrinsic information transfer” (EXIT) charts (Divsalar et al., 2000; El Gamal and Hammons, 2001; ten Brink, 2000, 2001). In EXIT-chart analysis, instead of tracking the density of messages, one tracks the evolution of a single parameter—a measure of the decoder’s success— iteration by iteration. For example one might track the “signal-to-noise ratio” of the extrinsic messages (Divsalar et al., 2000; El Gamal and Hammons, 2001), their error probability (Ardakani and Kschischang, 2004) or the mutual information between messages and decoded bits (ten Brink, 2000). Initially, the term “EXIT chart” was used when tracking mutual information; however, the use of this term was generalized in (Ardakani and Kschischang, 2004) to the tracking of other parameters as well. Let s denote the message-parameter being tracked, and let s0 denote the parameter associated with the channel messages. If si denotes the parameter associated with the extrinsic messages at the input of the ith iteration, then an EXIT chart is the function f (si , s0 ) that gives the value of the message parameter si+1 at the output of the ith iteration, i.e., we have si+1 = f (si , s0 ). In the remainder of this chapter, we use EXIT charts based on tracking the message error rate, as we ﬁnd them most useful for our applications. Thus we track the proportion of messages that give the “wrong” value for the corresponding symbol. If pin denotes this proportion at the input of an iteration, and pout denotes this proportion at the output of an iteration, then we have pout = f (pin , p0 ), where p0 denotes the proportion of “wrong” channel messages. For a ﬁxed p0 this function can be plotted using pin -pout coordinates. Usually EXIT charts are presented by plotting both f and its inverse f −1 , as this makes the visualization of the decoder easier. Figure 10.3 shows the concept. As can be seen from the ﬁgure, decoder progress can be visualized as a series of steps, shuttling between f and f −1 (as pout of one iteration becomes pin of the next). It can be seen that using EXIT charts, one can study how many iterations are required to achieve a target message error rate.

LDPC Decoder Analysis

301

Regular (3, 6) code 0.12

Iteration 1

0.1

Iteration 2 0.08

Iteration 3

out

Iteration 4 p

10.4

0.06

0.04 Predicted Trajectory Actual Results

0.02

0 0

0.02

0.04

0.06 p

0.08

0.1

0.12

in

An EXIT chart based on message error rate. Simulation results are for a randomly generated (3,6)-regular code of block length 200,000 on an AWGN with Eb /N0 = 1.75 dB.

Figure 10.3

The region of the graph between f and f −1 is referred to as the “decoding tunnel.” If the “decoding tunnel” of an EXIT chart is closed, i.e., if f and f −1 cross at some large pin , so that for some pin we have pout > pin , successful decoding to a small error probability does not occur. In such cases we say that the EXIT chart is closed (otherwise, it is open). An open EXIT chart always lies below the 45-degree line pout = pin . As p0 gets worse (i.e., as the channel degrades), the decoding tunnel becomes tighter and tighter, and hence the decoder requires more and more iterations to converge to a target error probability. Eventually, when p0 is bad enough, the tunnel closes completely. This condition gives an estimate of the code threshold. i.e., we may estimate the threshold p∗0 as the worst channel condition for which the tunnel is open by deﬁning p∗0 as p∗0 = arg sup{f (pin , p0 ) < pin , for all 0 < pin ≤ p0 }. p0

EXIT chart analysis is not as accurate as density evolution, because it tracks just a single parameter as the representative of a pdf. For many applications, however, EXIT charts are very accurate. For instance in (ten Brink, 2000, 2001), EXIT charts are used to approximate the behavior of iterative turbo decoders on a Gaussian channel very accurately. In (Ardakani and Kschischang, 2004) it is shown that, using EXIT charts, the threshold of convergence for LDPC codes on AWGN channel can be approximated within a few thousandths of a dB of the actual value.

302

Designing Patterns for Easy Recognition

In next section we show that EXIT charts can be used to design irregular LDPC codes which perform not more than a few hundredths of a dB worse than those designed by density evolution. One should also notice that when the pdf of messages can truly be described by a single parameter, e.g., in the so-called “binary erasure channel,” EXIT chart analysis is equivalent to density evolution. 10.4.3

Gaussian Approximations

There have been a number of approaches to one-dimensional analysis of sumproduct decoding of LDPC codes on the AWGN channel (Ardakani and Kschischang, 2004; Chung et al., 2001; Divsalar et al., 2000; Lehmann and Maggio, 2002; ten Brink and Kramer, 2003; ten Brink et al., 2004), all of them based on the observation that the pdf of the decoder’s LLR messages is approximately Gaussian. This approximation is quite accurate for messages sent from equality vertices, but less so for messages sent from check vertices. In this subsection we describe an accurate one-dimensional analysis for LDPC codes based on a Gaussian assumption only for the messages sent from the equality vertices. Because AWGN channels and binary symmetric channels treat 0’s and 1’s symmetrically, and because LDPC codes are binary linear codes, the behavior of a sum-product decoder is independent of which codeword was transmitted (Richardson and Urbanke, 2001). In the analysis of a decoder, we may therefore assume that the all-zero codeword (equivalent to the all-{+1} channel word) is transmitted. A probability density function f (x) is called symmetric if f (x) = ex f (−x). In (Richardson and Urbanke, 2001) it has been shown that if the LLR of the channel messages is symmetric, then all messages sent in sum-product decoding are symmetric. A Gaussian pdf with mean m and variance σ 2 is symmetric if and only if σ 2 = 2m. As a result, a symmetric Gaussian density can be expressed by a single parameter. Under the assumption that the all-zero codeword was transmitted over an AWGN channel, it turns out that the intrinsic LLR messages have a symmetric Gaussian density with a mean of 2/σ 2 and a variance of 4/σ 2 , where σ 2 is the variance of the Gaussian channel noise. It follows that under sum-product decoding, all messages remain symmetric. In addition, since the update rule at the equality vertices is the summation of incoming messages, according to the central limit theorem, the density of the messages at the output of equality vertices tends to be Gaussian, so it seems sensible to approximate them with a symmetric Gaussian. To avoid a Gaussian assumption on the output of check vertices, we consider one whole iteration at once. That is to say, we study the input-output behavior of the decoder from the input of the iteration (messages from equality vertices to check vertices) to the output of that iteration (messages from equality vertices to check vertices). Figure 10.4 illustrates the idea. In every iteration we assume that the input and the output messages shown in Figure 10.4, which are outputs of equality vertices, are symmetric Gaussian. We start with Gaussian distributed messages at

10.4

LDPC Decoder Analysis

303

input to next iteration channel message

messages from previous iteration Figure 10.4

A depth-one tree for a (3,6)-regular LDPC code.

the input of the iteration and compute the pdf of messages at the output. This can be done by “one-step” density evolution. Then we approximate the actual output pdf with a symmetric Gaussian. Since we assume that the all-zero codeword is transmitted, the negative tail of this density reﬂects the message error rate. As a result, we can track the evolution of message error rate and represent it in an EXIT chart. This technique led to the results shown in Figure 10.3, showing the close agreement between simulation results and the decoding behavior predicted by the EXIT-chart analysis. We refer to the method of approximating only the output of equality vertices with a Gaussian (but not the output of the check vertices) as the “semi-Gaussian” approximation. 10.4.4

Analysis of Irregular LDPC Codes

A single depth-one tree cannot be deﬁned for irregular codes, since not all vertices (even of the same type) have the same degree. If the check degree distribution is ﬁxed, each equality vertex of a ﬁxed degree gives rise to its own depth-one tree. Irregularity in the check vertices is taken into account in these depth-one trees. For any ﬁxed check degree distribution, we refer to the depth-one tree associated with a degree i equality vertex as the “degree i depth-one tree.” For reasons similar to the case of regular codes, we assume that at the output of any depth-one tree, the pdf of LLR messages is well-approximated by a symmetric Gaussian. As a result, the pdf of LLR messages at the input of check vertices can be approximated as a mixture of symmetric Gaussian densities. The weights of this mixture are determined by the proportion of equality vertices of each degree. Nevertheless, at the output of a given equality vertex, the distribution is still close to Gaussian and so the semi-Gaussian method can be used to ﬁnd the EXIT charts corresponding to the equality vertices of diﬀerent degrees. In other words, for any i, using the “degree i depth-one tree,” an EXIT chart associated with equality vertices of degree i can be found. We call such EXIT charts elementary EXIT charts. We

304

Designing Patterns for Easy Recognition

have pout,i = fi (pin , p0 ), where pout,i is the message error rate at the output of degree i equality vertices. Now, using Bayes’ rule, pout for the mixture of all equality vertices can be computed as P r(degree = i)P r(error|degree = i) pout = i≥2

=

λi fi (pin , p0 ),

(10.8)

i≥2

where λi denotes the proportion of edges incident on equality vertices of degree i. Thus we obtain the important result that the overall EXIT chart can be obtained as a weighted linear combination of elementary EXIT charts. A similar formulation can be used when the mean of the messages is the parameter tracked by the EXIT chart (Chung et al., 2001). It has been shown in (Tuechler and Hagenauer, 2002) that when the messages have a symmetric pdf, mutual-information also combines linearly to form the overall mutual-information.

10.5

Design of Irregular LDPC Codes We now describe how this EXIT-chart framework may be used to design irregular LDPC codes. In this framework, the design problem can be simpliﬁed to a linear program. Let λi denote the proportion of factor graph edges incident on an equality vertex of degree i, and let ρj denote the proportion of edges incident on a check vertex of degree j. We formulate the design problem for an irregular LDPC code as that of shaping an EXIT chart from a group of elementary EXIT charts (according to (10.8)) so that the rate of the code is maximized, but subject to the constraint that the resulting EXIT chart remains open, i.e., so that f (x) < x for all x ∈ (0, p0 ], where p0 is the initial message error rate at the decoder. ρ /j It can be shown that the rate of an LDPC code is at least 1− λji /i and hence, for a ﬁxed check degree distribution, the design problem can be formulated as the following linear program: & maximize: i≥2 λi /i λi ≥ 0, & i≥2 λi = 1 and

& ∀pin ∈ (0, p0 ] i≥2 λi fi (pin , p0 ) < pin . In the above formulation, we have assumed that the elementary EXIT charts are given. In practice, to ﬁnd these curves we need to know the degree distribution of the code. We need the degree distribution to associate every input pin to its equivalent input pdf, which is in general assumed to be a Gaussian mixture. In subject to:

10.5

Design of Irregular LDPC Codes

305

Table 10.1 A List of Irregular Codes Designed for the AWGN Channel by the Semi-Gaussian Method Degree sequence

Code 1

Code 2

Code 3

Code 4

Code 5

Code 6

dv1 , λdv1 dv2 , λdv2 dv3 , λdv3 dv4 , λdv4 dv5 , λdv5 dv6 , λdv6 dv7 , λdv7 dv8 , λdv8 dv9 , λdv9 dv10 , λdv10

2, .1786 3, .3046 5, .0414 6, .0531 7, .0007 10, .4216 — — — —

2, .1530 3, .2438 7, .1063 10, .2262 14, .0305 19, .0001 23, .1293 32, .0736 38, .0372 —

2, .1439 3, .1602 5, .1277 6, .0219 7, .0279 8, .0103 12, .1551 30, .0004 37, .3525 40, .0001

2, .1890 3, .1158 4, .1153 6, .0519 7, .0875 14, .0823 15, .0007 16, .0001 39, .3573 40, .0001

2, .2444 3, .1687 4, .0130 5, .1088 7, .1120 14, .1130 15, .0577 25, .0063 25, .0109 40, .1652

2, .3000 3, .1937 4, .0192 7, .2378 14, .0158 15, .0114 20, .0910 25, .0002 30, .0232 40, .1077

dc

40

24

22

10

7

5

Threshold σ

0.5072

0.6208

0.6719

0.9700

1.1422

1.5476

Rate

0.9001

0.7984

0.7506

0.4954

0.3949

0.2403

Gap to Shannon limit (dB)

0.1308

0.0630

0.0666

0.1331

0.1255

0.2160

other words, prior to the design, the degree distribution is not known and as a result, we cannot ﬁnd the elementary EXIT charts to solve the linear program above. To solve this problem, we suggest a recursive solution. At ﬁrst we assume that the input message to the iteration has a single symmetric Gaussian density instead of a Gaussian mixture. Using this assumption, we can map every message error rate at the input of the iteration to a unique input pdf and so ﬁnd fi curves for diﬀerent i. (It is interesting that even with this assumption the error in approximating the threshold of convergence, based on our observations, is less than 0.3 dB and the codes which are designed have a convergence threshold of at most 0.4 dB worse than those designed by density evolution. One reason for this is that when the input of a check vertex is mixture of symmetric Gaussians, due to the computation at the check vertex, its output is dominated by the Gaussian in the mixture having smallest mean.) After ﬁnding the appropriate degree distribution based on the single Gaussian assumption we use this degree distribution to ﬁnd the correct elementary EXIT charts based on a Gaussian mixture. Now we use the corrected curves to design an irregular code. In this level of design, the designed degree distribution is close to the degree distribution used in ﬁnding the elementary EXIT charts. Therefore, analyzing this code with its actual degree distribution shows minor error. One can continue these recursions for higher accuracy. However, in our examples after one iteration of design the designed threshold and the exact threshold diﬀered less than 0.01 dB.

306

Designing Patterns for Easy Recognition

We have designed a number of irregular codes with a variety of rates using this Gaussian approximation. The results are presented in Table 10.1. In the design of the presented codes, we have avoided any equality vertices or check vertices with degrees higher than 40. Table 10.1 suggests that for code rates more than 0.25, the method is quite successful. For rates greater than 0.85, getting close to capacity requires high-degree check vertices. To show that our method can actually handle high-rate codes, we designed a rate 0.9497 code, which uses check vertices of degree 120 but no equality vertices of degree greater than 40. The degree sequence for this code is λ = {λ2 = 0.1029, λ3 = 0.1823, λ6 = 0.1697, λ7 = 0.0008, λ9 = 0.1094, λ15 = 0.0240, λ35 = 0.2576, λ40 = 0.1533}. The threshold of this code is at σ = 0.4462, which means it has a gap of only 0.0340 dB from the Shannon limit.

10.6

Conclusions and Future Prospects Low-density parity-check (LDPC) codes are a ﬂexible family of codes with a simple decoding algorithm. As we have shown in this chapter, the structure of the codes can be ﬁne-tuned to allow for decoding even at channel parameters that approach the Shannon limit. Because of their excellent properties, LDPC codes have attracted an enormous research interest, and there is now a large body of literature which we have not attempted to survey in this chapter. Low-density parity-check codes are now beginning to emerge in a variety of communication standards, including, e.g., the DVB-S2 standard for digital video broadcasting by satellite (Eroa et al., 2004). An important research direction is that of ﬁnding LDPC codes with relatively short block lengths (a few thousand bits, say), but which still have excellent iterative decoding performance. For further reading in the area of LDPC codes, we recommend the books of Lin and Costello (2004) and MacKay (2003) and the survey article of (Richardson and Urbanke, 2003) as excellent starting points.

11

Turbo Processing

Claude Berrou, Charlotte Langlais, and Fabrice Seguin

Turbo processing is the way to process data in communication receivers so that no information stemming from the channel is wasted. The ﬁrst application of the turbo principle was in error correction coding, which is an essential function in modern telecommunications systems. A novel structure of concatenated codes, nicknamed turbo codes, was devised in the early 1990s in order to beneﬁt from the turbo principle. Turbo codes, which have near-optimal performance according to the theoretical limits calculated by Shannon, have since been adopted in several telecommunications standards. The turbo principle, also called the message-passing principle or belief propagation, is exploitable in signal processing other than error correction, such as detection and equalization. More generally, every time separate processors work on data sets that have some link together, the turbo principle may improve the result of the global processing. In digital circuits, the turbo technique is based on an iterative procedure, with multiple repeated operations in all the processors considered. Another more natural possibility is the use of analog circuits, in which the exchange of information between the diﬀerent processors is continuous.

11.1

Introduction Error correction coding, also known as channel coding, is a fundamental function in modern telecommunications systems. Its purpose is to make these systems work even in tough physical conditions, due for instance to a low received signal level, interference, or fading. Another important ﬁeld of application for error correction coding is mass storage (computer hard disk, CD and DVD-ROM, etc.), where the ever-continuing miniaturization of the elementary storage pattern makes reading the information more and more tricky. Error correction is a digital technique, that is, the information message to protect is composed of a certain number of digits drawn from a ﬁnite alphabet. Most often, this alphabet is binary, with logical elements or bits 0 or 1. Then, error

308

Turbo Processing

correction coding, in the so-called systematic way, involves adding some number of redundant logical elements to the original message, the whole being called a codeword. The mathematical law that is used to calculate the redundant part of the codeword is speciﬁc to a given code. Besides this mathematical law, the main parameters of a code are as follows. The code rate: the ratio between the number of bits in the original message and in the codeword. Depending on the application, the code rate may be as low as 1/6 or as high as 9/10. The minimum Hamming distance (MHD): the minimum number of bits that diﬀer from one codeword to any other. The higher the MHD, the more robust the associated decoder confronted with multiple errors. The ability of the decoder to exploit soft (analog) values from the demodulator, instead of hard (binary) values. A soft value (that is, the sign and the magnitude) carries more information than a hard value (only the sign). The complexity and the latency of the decoder. Since the seminal work by Shannon on the potential of channel coding (Shannon, 1948), many codes have been devised and used in practical systems. The state of the art, in the early 1990s, was the coding construction depicted in ﬁg. 11.1. This is called “standard concatenation” and is made up of a serial combination of a Reed-Solomon (RS) encoder, a symbol interleaver, and a convolutional encoder. The corresponding decoder (ﬁg. 11.1) is composed of a Viterbi decoder, a symbol de-interleaver, and an RS decoder. This concatenated scheme works nicely because the Viterbi decoder can easily beneﬁt from soft samples coming from the demodulator, while the RS decoder can withstand residual bursty errors that may come from the Viterbi decoder. Nevertheless, although the MHD of the concatenated code is very large, the decoder does not provide optimal error correction. Roughly, the performance is 3 or 4 dB from the theoretical limit. Where does this loss come from?

data to encode

Reed-Solomon encoder

Interleaver

Convolutional encoder

(a)

Channel

decoded data

Reed-Solomon decoder

De-interleaver

Convolutional (Viterbi) decoder

(b)

Standard concatenation of a Reed-Solomon encoder and a convolutional encoder, and the associated decoder.

Figure 11.1

11.2

Random Coding and Recursive Systematic Convolutional (RSC) Codes

309

The inner Viterbi decoder, processing analog-like input samples, is locally optimum, that is, it derives the maximum beneﬁt from the redundancy added by the convolutional encoder. The outer RS decoder, which is also locally optimum, beneﬁts from the work of the inner decoder and from the redundancy added by the RS encoder. Both decoders are optimum, each of them separately, but their association is not optimal: the Viterbi decoder does not exploit the redundancy oﬀered by the RS codeword. A global decoder, which would use the whole redundancy in one processing step, would be extremely complex and is not realistic. The way to contemplate near-optimal decoding of the standard concatenated scheme is to enable the inner decoder to beneﬁt from the work done by the outer decoder, using a kind of feedback. This observation is at the root of turbo processing, “turbo” being used to refer to the way the power of a turbo engine is increased by the reuse of its exhaust gases. This being said, the concatenated decoding scheme of ﬁgure 11.1 does not easily lend itself to such a feedback principle. The code has to be devised in order to enable bidirectional exchanges between the two component decoders.

11.2

Random Coding and Recursive Systematic Convolutional (RSC) Codes The theoretical limits were calculated by Shannon on the basis of random coding, which has since remained the reference in the matter of error correction. The systematic random encoding of a message having k information bits and producing a codeword with n bits may be achieved in the following way. As a ﬁrst step, once and for all, k binary words with n − k bits are drawn at random and memorized. These k words will constitute the basis of a vector space, the ith random word (1 ≤ i ≤ k) being associated with the information message containing only zeros (the “all-zero” message) except in the ith place. The redundant part of any codeword is obtained by calculating the sum modulo two of random words whose address i is such that the ith bit of the original message is one. The coding rate is R = k/n. This very simple construction leads to a very large MHD. Because two codewords diﬀer at least by one information bit, and thanks to the random feature of the redundant part, the mean distance is 1 + n−k 2 . Nevertheless, the MHD of the code being a random value, its diﬀerent realizations may be less than this mean value. A realistic approximation of the actual MHD is n−k 4 . Such large values, for instance 100 for n = 2k =800, are unreachable when using practical codes. Fortunately, these large MHDs are not necessary for common communications systems (Berrou et al., 2003). The device depicted in ﬁg. 11.2 is called a recursive systematic convolutional (RSC) code, whose length, the number of memory elements, is denoted ν. This encoder is based on the principle and the random features of the linear feedback register (LFR), also called a pseudo-random generator. When choosing an appropriate set for the feedback taps, the period P of the LFR is maximum and equal

310

Turbo Processing

to P = 2ν − 1. For suﬃciently large values of ν, P can be much higher than any length of messages to process, by several orders of magnitude. Therefore, the RSC code could then be assimilated to a quasi-perfect random code. Values of ν larger than 30 or 40 would be suﬃcient to make the RSC code equivalent to a random code. The message d = {d0 , . . . di , . . . , dk−1 } to be encoded feeds the LFR input and is transmitted as symbols X, as the systematic part of the codeword. The redundant or parity part is provided by the summation modulo two of certain binary values from the register. Using the D (delay) formalism, the redundant symbols Y are expressed as Y (D) =

G2 (D) d(D), G1 (D)

(11.1)

where G1 (D) = 1 + G2 (D) = 1 +

ν−1 & j=1 ν−1 &

(j)

G1 Dj + Dν and ,

(11.2)

(j)

G2 D j + D ν

j=1 (j)

are the polynomials deﬁning the taps for recursivity and parity construction. G1 (j) (resp. G2 ) is equal to 1 if the register tap at level j (1 ≤ j ≤ ν − 1) is used in the construction of recursivity (resp. parity), and 0 otherwise. G1 (D) and G2 (D) are generally deﬁned in octal forms. For instance, 1 + D3 + D4 is referred to as polynomial 23. Convolutional encoding exhibits a side eﬀect at the end of the coding process, which may be detrimental to decoding performance regarding the last bits of the message. In order to take its decision, the decoder uses information carried by current, past, and subsequent symbols, and the subsequent symbols do not exist at the end of the block. This point is known as the termination problem. Among several solutions to cope with this problem, the classical one consists in adding

in p u t d (D )

o u tp u t X (D ) = d (D ) (s y s te m a tic p a rt) Figure 11.2

s 1

s 2

G G

1

(D ) 2 ( D )

s n

o u tp u t Y (D ) (re d u n d a n t p a rt)

Recursive systematic convolutional (RSC) encoder, with length ν .

11.2

Random Coding and Recursive Systematic Convolutional (RSC) Codes

311

dummy information bits, called tail bits, to make the encoder return to the “allzero” state. Another more elegant technique is tail-biting, also called circular, termination (Weiss et al., 2001). This involves allowing any state as the initial state and encoding the sequence, containing k information bits, so that the ﬁnal state of the encoder register will be equal to the initial state. The trellis of the code (the temporal representation of the possible states of the encoder, from time i = 0 to i = k − 1) can then be regarded as a circle. In what follows, we will refer to circular recursive systematic convolutional (CRSC) codes, the circular version of RSC codes. Thus, without having to pay for any additional information, and therefore without impairing spectral eﬃciency, the convolutional code has become a real block code, in which, for each time i, the past is also the future, and vice versa. RSC or CRSC codes, like classical nonrecursive convolutional codes, are linear codes. Thanks to the linearity property, the code characteristics are expressed with respect to the all-zero sequence. In this case, any nonzero sequence d(D), accompanied by redundancy Y (D), will represent a possible error pattern for the coding/decoding system, one meaning a binary error. Equation 11.1 indicates that only a fraction of sequences d(D), which are multiples of G1 (D), lead to short length redundancy. We call these particular sequences return-to-zero (RTZ) sequences (Podemski et al., 1995), because they force the encoder, if initialized in state 0, to retrieve this state after the encoding of d(D). In what follows, we will be interested only in RTZ patterns, assuming that the decoder will never decide in favor of a sequence whose distance from the all-zero sequence is very large. The fraction of sequences d(D) that are RTZ is exactly p(RTZ) = 2−ν ,

(11.3)

because the encoder has 2ν possible states and an RTZ sequence ﬁnishes systematically at state 0. The shortest RTZ sequence is G1 (D) or its shifted version. Any RTZ sequence, in the block of k bits with circular termination, may be expressed as RTZ(D) = G1 (D)

k−1

ai Di

mod (1 + Dk ),

(11.4)

i=0

where ai takes value 0 or 1. Operation modulo (1+Dk ) transforms all Dx monomials in the resulting product into Dx mod k , for any integer x, so that all exponents are between 0 and k − 1. The minimum number of 1’s belonging to an RTZ sequence is two. This is because G1 (D) is a polynomial with at least two nonzero terms, and equation 11.4 then guarantees that RTZ(D) also has at least two nonzero terms. The number of 1’s in a particular RTZ sequence is called the input weight and is denoted w. We then have wmin = 2 for RSC codes, and the RTZ sequences with weight 2 are of

312

Turbo Processing

the general form P

RTZw=2 (D) = Dτ (1 + Dp )

mod (1 + Dk ),

(11.5)

where τ is the starting time, p any positive integer, and P the period of the encoder, as previously introduced. RTZ sequences with odd weight may either exist or not, depending on the expression of G1 (D). RTZ sequences with even weight always exist, especially of the form RTZw=2l (D) =

l−1

P

Dτj (1 + Dpj )

mod (1 + Dk ),

(11.6)

j=1

that is, as a combination of l any weight-2 RTZ sequences, with τj and pj as any positive integers. This sort of composite RTZ sequence has to be considered closely when trying to design good permutations for turbo codes, as explained in section 11.5. What we are searching for is a very long RSC code, having a large period P , in order to take advantage of quasi-perfect random properties. But such codes cannot be decoded, due to the too-large number of states to consider and to process. That is why other forms of random-like codes, a little more sophisticated, have to be devised.

11.3

Turbo Codes In the previous section, we saw that the probability that any given sequence is an RTZ sequence for a CRSC encoder is 1/2ν . Now, if we encode this sequence N times (ﬁg. 11.3 with ν = 3), each time in a diﬀerent order and drawn at random by permutation Πj (1 ≤ j ≤ N ) (the ﬁrst order may be the natural order), the probability that the sequence remains RTZ for all encoders is lowered to 1/2N ν . For example, with ν = 3 and N = 7, this probability is less than 10−6 . This technique is known as a multiple parallel concatenation of CRSC codes (Berrou et al., 1999). Of course, to deal with realistic coding rates (around 1/2), some puncturing has to be performed, that is, not all the redundant symbols Yj are used to form the codeword. For instance, if R = 1/2, each component encoder provides only k/N parity bits. If the message is not RTZ, after permutation Πj , the average weight of k (assuming that every other bit in {Yj } is 1, statistically). This sequence {Yj } is 2N guarantees a large distance when one permuted sequence, at least, is not RTZ. Fortunately, it is possible to obtain quasi-optimum performance with only two encodings (ﬁg. 11.3), and this is a classical turbo code (Berrou et al., 1993). For bit error rates (BERs) higher than around 10−5 , the permutation may still be drawn at random but, for lower rates, a particular eﬀort has to be made in its design. The way the permutation is devised ﬁxes the MHD dmin of the turbo code, and therefore the achievable asymptotic gain Ga oﬀered by the coding scheme, according to the

11.3

Turbo Codes

313 S y s te m a tic p a rt

k b in a ry d a ta

S y s te m a tic p a rt

C R S C

p e rm u ta tio n P 1 (id e n tity )

Y 1

c o

C R S C

p e rm u ta tio n P 2

C

k b in a ry d a ta

d

1

Y

w o 2

r d

p e rm u ta tio n P

C 2

Y

e w 1

o

C R S C

r d

Y

C R S C

p e rm u ta tio n P N

o d

e Y

c

C R S C

2

N

(b )

(a )

(a) In this multiple parallel concatenation of circular recursive systematic convolutional (CRSC) codes, the block containing k information bits is encoded N times. The probability that the sequence remains of the return-to-zero (RTZ) type after the N permutations, drawn at random (except the ﬁrst one), is very low. The properties of this multiconcatenated code are very close to those of random codes. (b) The number of encodings can be limited to two, provided that permutation Π is judiciously devised. This is a classical turbo code.

Figure 11.3

well-known approximation Ga ≈ 10 log(Rdmin )

(11.7)

The natural coding rate of a turbo code is R = 1/3. In order to obtain higher rates, certain redundant symbols are punctured. For instance, Y1 and Y2 symbols are transmitted alternately to achieve R = 1/2. A particular turbo code is deﬁned by the following parameters. m, the number of bits in the input words. Applications known so far consider binary (m = 1) and double-binary (m = 2) input words (see section 11.6). The component codes C1 and C2 (code memory ν, recursivity and redundancy polynomials). The values of ν are 3 or 4 in practice and the polynomials are generally those that are recognized as the best for simple unidimensional convolutional

314

Turbo Processing

coding, that is, (15,13) for ν = 3 and (23,35) for ν = 4, or their symmetric forms. The permutation function, which plays a decisive role when the target BER is lower than about 10−5 . Above this value, the permutation may follow any law, provided of course that it respects at least the scattering property (the permutation may be the regular one, for instance). The puncturing pattern. This has to be as regular as possible, like for simple convolutional codes. In addition to this rule, the puncturing pattern is deﬁned in close relationship with the permutation function when very low errors rates are sought for.

11.4

Turbo Decoding Decoding a composite code by a global single process is not possible in practice, because of the tremendous number of states to consider. A joint probabilistic process by the decoders of C1 and C2 has to be elaborated, following a kind of divide-andconquer strategy. Because of local latency constraints, this joint process is worked out in an iterative manner in a digital circuit. Analog versions of the turbo decoder are also considered, oﬀering several advantages, as explained in section 11.7. Turbo decoding relies on the following fundamental criterion, which is applicable to all so-called message-passing or belief-propagation algorithms (McEliece et al., 1998): When having several probabilistic machines work together on the estimation of a common set of symbols, all the machines have to give the same decision, with the same probability, about each symbol, as a single (global) decoder would. To make the composite decoder satisfy this criterion, the structure of ﬁg. 11.4 is adopted. The double loop enables both component decoders to beneﬁt from the whole redundancy. The components are soft-in-soft-out (SISO) decoders, permutation Π and inverse permutation Π−1 memories. The node variables of the decoder are logarithms of likelihood ratios (LLRs), also simply called log-likelihood ratios. An LLR related to a particular binary datum di (0 ≤ i ≤ k − 1) is deﬁned, apart from a multiplying factor, as Pr(di = 1) LLR(di ) = ln . (11.8) Pr(di = 0) The role of a SISO decoder is to process an input LLR and, thanks to local redundancy (i.e., y1 for DEC1, y2 for DEC2), to try to improve it. The output LLR of a SISO decoder, for a binary datum, may be simply written as LLRout (di ) = LLRin (di ) + z(di ),

(11.9)

where z(di ) is the extrinsic information about di , provided by the decoder. If this

11.4

Turbo Decoding

315

X C

z P

1

d a ta d (b its )

L L R

p e rm u ta tio n P

C

Y

y

2

1

x

y

(a )

2

in ,1

8 -s ta te S IS O D E C 1

1

8 -s ta te S IS O D E C 2

2

P Y

1

x

L L R

in ,2

(x + z 2)+ z L L R

L L R

o u t,1

d eco d ed o u tp u t o u t,2

(x + z 1)+ z P

- 1

z

1

d

^

2

2

(b )

Figure 11.4 An 8-state turbo code (a) and its associated decoder (b), with a basic structure assuming no delay processing.

works properly, z(di ) is most of the time negative if di = 0, and positive if di = 1. The composite decoder is constructed in such a way that only extrinsic terms are passed by one component decoder to the other. The input LLR to a particular decoder is composed of the sum of two terms: the information symbols (x) stemming from the channel, also called the intrinsic values, and the extrinsic terms (z) provided by the other decoder, which serve as a priori pieces of information. The intrinsic symbols are inputs common to both decoders, which is why extrinsic information does not contain them. In addition, the outgoing extrinsic information does not include the incoming extrinsic information, in order to minimize correlation eﬀects in the loop. The subtractors in ﬁg. 11.4 are used to remove intrinsic and extrinsic information from the feedback loops. Nevertheless, because the blocks have ﬁnite length, correlation eﬀects between extrinsic and intrinsic values may exist and degrade the decoding performance. The practical course of operation is: Step 1: process the data peculiar to one code, say C2 (x and y2 ), by decoder DEC2, and store the extrinsic pieces of information (z2 ) resulting from the decoding in a memory. If data are missing because of puncturing, the corresponding values are set to analog 0 (neutral value). Step 2: process the data speciﬁc to C1 (x, deinterleaved z2 and y1 ) by decoder DEC1, and store the extrinsic pieces of information (z1 ) in a memory. By properly organizing the read/write instructions, the same memory can be used for storing both z1 and z2 . Steps 1 and 2 make up the ﬁrst iteration. Step 3: process C2 again, now taking interleaved z1 into account, and store the updated values of z2 . And so on.

316

Turbo Processing

The process ends after a preestablished number of iterations, or after the decoded block has been estimated as correct, according to some stop criterion (see Matache et al., 2000, for possible stopping rules). The typical number of iterations for the decoding of convolutional turbo codes is four to ten, depending on the constraints relating to complexity, power consumption, and latency. According to the structure of the decoder, after p iterations, the output of DEC1 is LLRout1,p (di ) = (x + z2,p−1 (di )) + z1,p (di ), where zu,p (di ) is the extrinsic piece of information about di , yielded by decoder u after iteration p, and the output of DEC2 is LLRout2,p (di ) = (x + z1,p−1 (di )) + z2,p (di ). If the iterative process converges toward ﬁxed points, z1,p (di ) − z1,p−1 (di ) and z2,p (di ) − z2,p−1 (di ) both tend to zero when p goes to inﬁnity. Therefore, from the equations above, both LLRs become equal, which fulﬁlls the fundamental condition of equal probabilities provided by the component decoders for each datum di . As for the proof of convergence itself, one can refer to various papers dealing with the theoretical aspects of the subject, such as Weiss and Freeman (2001) and Duan and Rimoldi (2001). An important tool for the analysis of convergence is the EXIT chart (ten Brink, 2001). EXIT, which stands for extrinsic information transfer, considers both SISOs decoders in the turbo decoder as nonlinear transfer functions of extrinsic information, in a statistical way. Turbo decoding is not optimal. This is because, during the ﬁrst half-iteration, an iterative process has obviously to begin with only a part of the redundant information available (either y1 or y2 ). Furthermore, correlation eﬀects between noises aﬀecting intrinsic and extrinsic terms may be detrimental. Fortunately, loss due to suboptimality is small, about some tenths of one dB. There are two families of SISO algorithms, those based on the Viterbi algorithm (Battail, 1987; Hagenauer and Hoeher, 1989), which can be used for highthroughput continuous-stream applications; the others based on the APP (a posteriori probability, also called MAP or BCJR) algorithm (Bahl et al., 1974) or its simpliﬁed derived versions (Robertson et al., 1997) for block decoding. If the full APP algorithm is chosen, it is better for extrinsic information to be expressed by probabilities instead of LLRs, which avoids calculating a useless variance for extrinsic terms. In practice, depending on the kind of SISO algorithm chosen, some tuning operations (multiplying, limiting) on extrinsic information are added to the basic structure to ensure stability and convergence within a small number of iterations.

11.5

11.5

Permutation

317

Permutation In a turbo code, permutation plays a double role: 1. It must ensure maximal scattering or spreading of adjacent bits, in order to minimize the correlation eﬀects in the message passing between the two component decoders. 2. It contributes greatly to the value of the MHD. Between a badly designed and a well-designed permutation, the MHD may diﬀer by a factor of 2 or 3. Let us consider the binary turbo code represented in ﬁg. 11.4, with permutation falling on k bits. The worst permutation we can imagine is permutation identity, which minimizes the coding diversity (i.e., Y1 = Y2 ). On the other hand, the best permutation that could be used, but which probably does not exist (Svirid, 1995), could allow the concatenated code to be equivalent to a sequential machine whose irreducible number of states would be 2k+6 . There are actually k + 6 binary storage elements in the structure, k in the permutation memory and 6 in the encoders. Assimilating this machine to a convolutional code would give a very long code and very large minimum distances, for usual values of k. From the worst to the best of permutations, there is great choice between the k! possible combinations, and we still lack a sound theory about this. Nevertheless, good permutations have already been designed to elaborate normalized turbo codes using pragmatic approaches. 11.5.1

Regular Permutation

Maximum spreading (criterion 1 above) is achieved by regular permutation. For a long time, regular permutation was almost exclusively seen as rectangular (linewise writing and columnwise reading in an ad hoc memory, ﬁg. 11.5). When using CRSC codes as the component codes of a turbo code, circular permutation, based on congruence properties, is more appropriate. Circular permutation, for blocks having k information bits (ﬁg. 11.5), is devised as follows. After writing the data in a linear memory, with address i (0 ≤ i ≤ k − 1), the block is likened to a circle, both extremities of the block (i = 0 and i = k − 1) then being contiguous. The data are read out such that the jth datum read was written at the position i given by i = Π(j) = P j

mod k,

where the skip value P is an integer, relatively prime with k. We deﬁne the total spatial distance (or span) S(j1 ,j2 ) as the sum of the two spatial distances, before and after permutation, for a given pair of positions j1 and j2 : S(j1 , j2 ) = f (j1 , j2 ) + f (Π(j1 ), Π(j2 )),

(11.10)

318

Turbo Processing

N columns i=0

i=k-1 i=0

writing M rows

k = M.N

i = P - (k mod. P) i=P

i = 2.P

reading i=k-1

(b)

(a) Figure 11.5

Rectangular (a) and circular (b) permutation.

where f (u, v) = min{|u − v|, k − |u − v|}.

(11.11)

Finally, we denote Smin the minimum value of S(j1 ,j2 ) for all possible pairs j1 and j2 : Smin = min{S(j1 , j2 )}. j1 ,j2

(11.12)

With regular permutation, the value of P that maximizes Smin (Boutillon and Gnaedig, 2005) is √ (11.13) P0 = 2k, with the condition: k=

P0 2

mod P0 ,

(11.14)

which gives: Smin = P0 =

√

2k.

(11.15)

In practice, to comply as far as possible with the criterion of maximum total spatial distance, P is chosen as an integer close to P0 , and prime with k. 11.5.2

Real Permutations

Let us recall that a decoder of an RSC code is only sensitive to error sequences of the RTZ type, as introduced in section 11.2. Then, real permutations have to best satisfy the following ideal rule: If a sequence is RTZ before permutation, then it is not RTZ after permutation, and vice versa.

11.5

Permutation

319

In this case, at least one of the component decoders in ﬁg. 11.4 is able to recover from the errors. But the previous rule is impossible to comply with, and a more realistic target is this one: If a sequence is short RTZ before permutation, then either it is not RTZ or it is long RTZ after permutation, and vice-versa. The dilemma in the design of a good permutation lies in the need to satisfy this practical rule for two distinct classes of codewords, which require conﬂicting treatment. The ﬁrst class contains all nonzero codewords (again with reference to the “all zero” codeword) that are not combinations of simple RTZ sequences, and a good permutation for this class is as regular as possible, which ensures maximum spreading. This type of sequence has low input weight (w ≤ 3). N columns 00...001000000100...00

M rows

X

0

0 (a)

C1

1101

data regular permutation on k = M.N bits

0

0

Y1 C2

(b)

Y2

0 (c)

1101 0000 0000 0000 0000 0000 0000 1101

0

1101 1101 0000 1101

0

Figure 11.6 Some possible RTZ (return to zero) sequences for both encoders C1 and C2 , with G1 (D) = 1 + D + D3 (period L = 7). (a) With input weight w = 2; (b) with w = 3, (c) with w = 6 or 9.

The second class encompasses all codewords that are combinations of simple RTZ sequences, and nonuniformity (controlled disorder) has to be introduced into the permutation function to obtain a large MHD. Figure 11.6 illustrates the situation, showing the example of a 1/3 rate turbo code, using component binary encoders with code memory ν = 3 and periodicity L = 2ν − 1 = 7. For the sake

320

Turbo Processing

of simplicity, the block √ of k bits is organized as a rectangle with M rows and N columns (M ≈ N ≈ k). Regular permutation is used, that is, data are written linewise and read columnwise. Figure 11.6a depicts a situation where encoder C1 (the horizontal one) is fed by an RTZ sequence with input weight w = 2. Redundancy Y1 delivered by this encoder is poor, but redundancy Y2 produced by encoder C2 (the vertical one) is very informative for this pattern, which is also an RTZ sequence but whose span is 7M instead of 7. The associated MHD would be around 7M 2 , which is a large value for typical sizes k. With respect to this w = 2 case, the code is said to be “good” because dmin tends to inﬁnity when k tends to inﬁnity. Figure 11.6b deals with a weight-3 RTZ sequence. Again, whereas the contribution of redundancy Y1 is not high for this pattern, redundancy Y2 gives relevant information over a large span, of length 3M . The conclusions are the same as for the above case. Figure 11.6c shows two examples of sequences with weights w = 6 and w = 9, which are RTZ sequences for encoder C1 as well as for encoder C2 . They are obtained by a combination of two or three minimal-length RTZ sequences. The weight of redundant bits is limited and depends neither on M nor on N . These patterns are typical of codewords that limit the MHD of a turbo code when using a regular permutation. In order to “break” rectangular patterns, some disorder has to be introduced into the permutation rule while ensuring that the good properties of regular permutation, with respect to low weights, are not lost. This is the crucial problem in the search for good permutation, which has not yet found a deﬁnitive answer. Nevertheless, some good permutations have already been devised for recent applications, e.g., Technical Speciﬁcation Groups, IMT-2000 (3GPP, 1999; TIA/EIA/IS, 1999), and DVB (, DVB,D)). Further details about non-regular permutations can be found in Crozier and Guinand (2003) and Berrou et al. (2004).

11.6

Applications of Turbo Codes Depending on the constraints imposed by the application (performance, throughput, latency, complexity, etc.), error correction codes can be divided into many families. We will consider here three domains, related to error rates: Medium error rates (corresponding roughly to 10−2 > BER > 10−6 or 1 > FER > 10−4 ): This is typically the domain of automatic repetition request (ARQ) systems and is also the more favorable level of error rates for turbo codes. To achieve near-optimum performance, eight-state component codes are suﬃcient. Figure 11.7 depicts the practical binary turbo code used for these applications and coding rates equal to or lower than 1/2. For higher rates, the double-binary turbo code of ﬁgure

11.6

Applications of Turbo Codes

321 B A X

k b in a ry d a ta

k /2 b in a ry c o u p le s p e rm u ta tio n

Y 1

Y

P

1

p e rm u ta tio n

P

Y

(a )

2

(b ) Y

2

p o ly n o m ia ls 1 5 , 1 3 (o r 1 3 , 1 5 ) B X

A

k b in a ry d a ta

k /2 b in a ry c o u p le s Y

p e rm u ta tio n 1

Y

P

1

p e rm u ta tio n

P

Y

(c )

2

p o ly n o m ia ls 2 3 , 3 5 (o r 3 1 , 2 7 )

(d ) Y

2

The four turbo codes used in practice. (a) 8-state binary, (b) 8-state double-binary, both with polynomials 15, 13 (or their symmetric form 13, 15), (c) 16-state binary, (d) 16-state double-binary, both with polynomials 23, 35 (or their symmetric form 31, 27). Binary codes are suitable for rates lower than 1/2, doublebinary codes for rates higher than 1/2.

Figure 11.7

11.7 is preferable (Berrou and J´ez´equel, 1999). For each of them, one example of performance, in frame error rate (FER) as a function of signal-to-noise ratio Eb /N0 , is given in ﬁg. 11.8 (UMTS: R = 1/3, k = 640 and DVB-RCS: R = 2/3, k = 1504). Low error rates (10−6 > BER > 10−11 or 10−4 > FER > 10−9 ): sixteen-state turbo codes perform better than eight-state ones, by about 1 dB, for an FER of 10−7 (see ﬁg. 11.8). Depending on the sought-for compromise between performance and decoding complexity, one can choose either one or the other. Figures 11.7d and 11.7d depict the 16-state turbo codes that can be used, the binary one for low rates, the double-binary one for high rates. In order to obtain

322

Turbo Processing

Frame Error Rate 5 10-1 5 10-2 5 10-3 5 10-4 5 10-5 5 10-6 5 10-7 5 10-8 1

2

3

QPSK, 8-state binary, R = 1/3, 640 bits

5

4

Eb/N0 (dB)

QPSK, 8-state double-binary, R = 2/3, 1504 bits

QPSK, 16-state double-binary, R = 2/3, 1504 bits

8-PSK, 16-state double-binary, R = 2/3, 1504 bits, pragmatic coded modulation

Figure 11.8 Some examples of performance, expressed in FER, achievable with turbo codes on Gaussian channels. In all cases: decoding using the Max-Log-APP algorithm with 8 iterations and 4-bit input quantization.

good results at low error rates, the permutation function must be very carefully devised. An example of performance, provided by the association of 8-PSK (phase-shiftkeying) modulation and the turbo code of ﬁg. 11.7d, is also plotted in ﬁg. 11.8, for k = 1, 054 and a spectral eﬃciency of 2 bit/s/Hz. This association is made according to the pragmatic approach, that is, the codec is the same as the one used for binary modulation. It just requires binary-to-octary conversion, at the transmitter side, and the converse at the receiver side. Very low error rates (10−11 > BER or 10−9 > FER): The largest minimum distances that can be obtained from turbo codes, for the time being, are not suﬃcient to prevent a slope change in the BER(Eb /N0 ) or FER(Eb /N0 ) curves, at very low error rates. Compared to what is possible today, an increase of MHDs by roughly 25% would be necessary to make turbo codes attractive for this type of application, such as optical transmission or mass storage error protection. Table 11.1 summarizes the normalized applications of convolutional turbo codes, known to date. The ﬁrst three codes of ﬁg. 11.8 have been chosen for these various systems.

11.7

Analog Turbo Processing Table 11.1

11.7

323

Current Known Applications of (Convolutional) Turbo Codes

Application

Turbo Code

Termination

Polynomials

Rates

CCSDS (deep space)

binary, 16-state

tail bits

23, 33, 25, 37

1/6, 1/4, 1/3, 1/2

UMTS, CDMA2000 (3G mobile)

binary, 8-state

tail bits

13, 15, 17

1/4, 1/2

DVB-RCS (Return channel over satellite)

double-binary, 8-state

circular

15, 13

1/3 up to 6/7

DVB-RCT (Return channel over terrestrial)

double-binary, 8-state

circular

15, 13

1/2, 3/4

Inmarsat (M4)

binary, 16-state

no

23, 35

1/2

Eutelsat (Skyplex)

double-binary, 8-state

circular

15, 13

4/5, 6/7

IEEE 802.16 (WiMAX)

double-binary, 8-state

circular

15, 13

1/2 up to 7/8

1/3,

Analog Turbo Processing The ever-improving performance of A/D and D/A data converters coupled with the achievements of the Moore’s law have resulted today in fully digital processing. Thus, digital signal processors and other digital programmable devices have superseded the analog circuits traditionally used in telecommunications transceivers. The only blocks that have not yet been totally replaced by digital counterparts are the front-end blocks such as ampliﬁers and oscillators. One may ask if this tendency is the most adapted to the design of some receiver functions, in particular error correction having to cope with spurious analog signal. From ﬁg. 11.9, which represents a generic digital transceiver, the following comments about the nature of the signal through the chain can be made. First the data to send are processed by the channel encoder, which adds redundant bits in order to make these data more resilient to channel noise. The resulting digital signal is next modulated and up-converted using a high-frequency carrier. This results in an analog signal that is transmitted over the channel. As the signal propagates through the channel, it is altered by various noise sources which are themselves analog (for example weather or electromagnetic conditions, interference). On the receiver side the corrupted data are ampliﬁed and down-converted in baseband. This is again analog processing. A crucial choice has now to be made. Should the signal be digitized or should it remain analog? Should it take a hard decision or a soft one? The loss of information due to the quantization does not plead in favor of a digital solution. This explains the motivation of some laboratories that propose the implementation of channel decoders in analog form

324

Turbo Processing

(Hagenauer, 1997a; Lustenberger et al., 1999; Moerz et al., 2000). These studies have not only proved the validity of the concepts but have also shown signiﬁcant gains in terms of speed, power consumption, and silicon area over digital solutions (Gaudet and Gulak, 2003; Moerz et al., 2000).

Power Amplifier

Channel Coding

...0110110...

Digital

Modulator

Analog

Noisy Channel Carrier

...0110110 ...

Figure 11.9

Channel Decoding

Digital or Analog

Low Noise Amplifier

Demodulator

Analog

Down Converter

Analog

Generic digital transceiver.

Historically, the analog implementation of error correction algorithms began with some research on the soft-output Viterbi algorithm (SOVA) (Battail, 1987; Hagenauer and Hoeher, 1989). Nevertheless the Viterbi algorithm does not easily lend itself to analog implementation, for which the APP algorithm (Anderson and Hladick, 1998; Bahl et al., 1974) is preferred. It uses only sum/product operators and logarithm/exponential functions. The latter is necessary to convert probabilities into log-likelihood ratios (LLR), as introduced in section 11.4, and vice versa. LLRs are available at the output of a soft demodulator such as that described in Seguin et al. (2004). If X is a binary random variable and x its observation, the LLR is deﬁned by Pr(X = 1|x) . (11.16) LLR(X) = ln Pr(X = 0|x) The exponential and natural logarithm functions are readily available from a bipolar junction transistor (BJT) biased in the forward active region. The collector current IC depends on base-emitter voltage VBE according to the well-known relation: VBE IC ≈ IS exp , (11.17) UT where IS is the saturation current and UT the thermal voltage. When connected as a diode (ﬁg. 11.10a), the transistor produces a voltage V between collector and emitter that depends on current I as I . (11.18) V ≈ UT ln Is Associating a current with a probability, and a voltage with an LLR, it is thus

11.7

Analog Turbo Processing

325

I

I

C 3

Q I

V

C B

# E V

B E

I

I

C 5

Q 4

Q 5

I V

I

C 1

Q

V

Q 1

I I

2

I

i2

1

I

I

b ia s

I

= 1 x ) ù

ëê P r ( X

= 0 x ) ûú

i1

= U

V

i2

= U

C 1

= I

b ia s

P r (Y = 1 y )

C 2

= I

b ia s

P r (Y = 0 y )

C 3

= I

b ia s

P r (Y = 1 y ) P r ( X = 1 x )

C 4

= I

b ia s

P r (Y = 1 y ) P r ( X = 0 x )

C 5

= I

b ia s

P r (Y = 0 y ) P r ( X = 0 x )

C 6

= I

b ia s

P r (Y = 0 y ) P r ( X = 1 x )

6

C 2

é P r(X

V

C 6

i1

I C

Q 3

C 4

ln ê T

T

ú

é P r (Y

= 1 y ) ù

ëê P r ( Y

= 0 y ) ûú

ln ê

ú

(b )

(a )

Figure 11.10 Basic structures for analog decoders. (a) Diode connected bipolar transistor; (b) Gilbert cell.

possible to convert LLRs into probabilities and probabilities into LLRs, by using diﬀerential structures with transistors and diodes. Currents can also be easily added or multiplied in order to satisfy the APP operations. The basic computing block of the decoder is the well-known analog multiplier (ﬁg. 11.10b) called the Gilbert cell (Gilbert, 1968). This cell can convert LLRs into probabilities and multiply them by each other at the same time. The collector currents of transistors Q3 , Q4 , Q5 , Q6 which represent information can also be summed at a certain node of the circuit. When adding a diode to the collector of each transistor, information can be reconverted into LLR form. Thus the APP algorithm can be implemented using a BJT-based network that directly maps the code trellis. 11.7.1

Implementation of the APP Algorithm

The encoder, at time i (0 ≤ i ≤ k − 1), and the trellis section of an R = 1/2 four-state recursive systematic convolutional (RSC) code are depicted in ﬁg. 11.11. This trellis is assumed to be circular, that is, the states at the beginning and at the end of the encoding are equal. After receiving the LLRs stemming from the channel, the frame containing n = 2k symbols is decoded by feeding the decoder inputs in parallel and by letting the analog network converge toward a stable state. The topology of the circular decoder is given in ﬁg. 11.12. The on-chip network is the direct translation of the APP algorithm (Anderson and Hladick, 1998). It is divided into as many sections as the k information bits to decode. Each section is built from several modules: a Γ module to compute the branch metrics,

Turbo Processing

(i)

Xi

(i +1) 0/00

0

1

di

+

D

0 1/11

+

1

1/1

states

326

D

Yi

1

1/10

0/

2

+

0/00

01

2

1/10

3

0/01

3

d/XY

A 4-state RSC encoder with code rate R = 1/2. The input information symbol X is transmitted together with the redundant symbol (parity bit) Y. A trellis section is also shown, whose branches are labeled with the encoded symbols.

Figure 11.11

an A module to compute the forward metrics, a B module to compute the backward metrics, and a Dec module that takes a ﬁnal hard decision on the value of the information bit. The input samples LLR(Xi ) and LLR(Yi ) are associated with the ith couple of transmitted symbols Xi and Yi . The forward and backward metrics αi and βi+1 are yielded, for each branch of the trellis, by the adjacent trellis sections. The outputs of the section are the metrics αi+1 and βi , which are used as inputs to the adjacent sections, as well as the hard decision dˆi . Moreover, in order to implement a turbo decoder, an additional module—Extr—module is required to compute extrinsic information LLRext (Xi ). These values are then used as inputs for the Γ module of another APP decoder. As examples, the Γ and A modules of the four-state decoder are illustrated in ﬁg. 11.13. The branch metrics γ are directly obtained by using the outputs of the Gilbert cell fed with the LLRs. Let αi (s) be the forward metric associated with the state s of the ith section. Let γi (s , s) be the branch metric between any state s of the ith section and one linked state s of the (i + 1)st section. Then the four forward metrics of each section i between 0 and k − 1 are recursively computed as follows: αi+1 (s) =

3

αi (s )γ(s , s).

(11.19)

s =0

Therefore, the Gilbert cell has to be extended to perform the four multiplications and additions, as shown in ﬁg. 11.13. Note that the structures presented require relatively few transistors. This leads to low silicon area and low power consumption, and opens up the way to fully parallel processing. This later property is essential for the design of high-speed decoders.

11.7

Analog Turbo Processing

327

L L R ( : E)

L L R ( ; E)

L L R e x t in ( ( T u r b o o n ly )

G

a

a A

b

i

i+ 1

b B

i

: E)

i+ 1

E x tr

(T u rb o o n ly )

L L R

e x t

( : E)

S o ft r e lia b ility

D e c H a r d D e c is io n

d i

Analog circular APP decoder. There are as many sections as information bits to decode. Each section is made up of modules related to the APP algorithm operations.

Figure 11.12

11.7.2

The Next Step: The Analog Turbo Decoder

As a stand-alone decoder of a simple convolutional code, the APP algorithm does not oﬀer outstanding performance. However, once used in a turbo architecture (Berrou et al., 1993), the APP algorithm reaches its full potential. As shown in ﬁg. 11.14, the two APP decoders exchange information (the so-called extrinsic information), through interleavers, on the reliability of the received data. In a digital version of the turbo decoder, these exchanges are clocked, and the decoding process is repeated as many times as necessary to reach a solution. The complexity and the latency of the decoding process are proportional to the number of iterations, which could be drawbacks for some applications. As can be easily seen, the turbo architecture is well suited for analog implementation since it is a simple feedback system that does not require any internal clocking. In the analog version, the exchanges of extrinsic information are continuous in time. This property, associated with the high degree of parallelism mentioned previously, leads to potential throughput of several Gbit/s, together with low complexity and latency. To give an example, the architecture of a complete DVB-RCS turbo decoder (, DVB) was simulated using behavioral models. The decoding of the smallest frame of this standard (48 double-binary symbols and rate 1/2) conﬁrms the high capacity of the analog turbo decoder (Arzel et al., 2004). Besides the gains in throughput,

328

Turbo Processing

a

g

0 0

g

0 1

g

1 0

g

1 1

E+ 1

(0 )

g

a

0 0

a E( 0 )

g

0 1

g

g

1 0

E+ 1

(1 )

g

a

1 1

g

0 0

a E( 1 )

g

0 1

g

1 0

(2 )

E+ 1

a

g

1 1

0 0

a E( 2 ) 1

g

g

0 1

a E( 3 )

A M o d u le

b ia s

g

1 0

g

0 0

g

0 1

g

1 1

L L R (: )

L L R (; ) 1

Figure 11.13

G M o d u le

Transistor-level design of Γ and A modules.

L L R (X )

A n a lo g A P P D e c o d e r 1 +

L L R (Y 1)

P

L L R (Y 2)

Figure 11.14

b ia s

+

P

X L L R

e x t(

X

)

i1

- 1

P

A n a lo g A P P D e c o d e r 2

The analog turbo decoder.

1 0

L L R

e x t(

X

i2

)

g

1 1

E+ 1

(3 )

11.8

Other Applications of the Turbo Principle

329

complexity, and latency, ﬁg. 11.15 illustrates that analog decoding also yields a gain in performance (about 0.1 dB for a BER of 10−4 ) compared to the digital version. The equivalent digital circuit uses ﬂoating-point number representation and runs for 15 iterations, which provides maximum iterative performance. The analog decoder performs better because it beneﬁts from continuous time: there is no iteration but continuous sharing of extrinsic information.

Analog turbo decoder Digital turbo decoder 15 it.

FER 10

-1

BER for non-coded QPSK

Error Rate

BER 10

-2

10

-3

10

-4

0,8

1,0

1,2

1,4

1,6

1,8

2,0

2,2

2,4

2,6

2,8

3,0

3,2

3,4

3,6

3,8

4,0

E b/N 0 (dB)

Bit and frame error rate curves for the DVB-RCS double-binary turbo code. Frames with k = 96 information bits and R = 1/2. Analog and digital decoding simulation results are compared.

Figure 11.15

In conclusion, the ability of the analog decoder to use soft time, in addition to soft input samples, enhances the error correction of turbo codes while increasing data rates and reducing complexity and latency.

11.8

Other Applications of the Turbo Principle For the sake of clarity we assume a point-to-point transmission with only one transmitter and one receiver. However, this discussion can be extended to more complex situations such as wireless broadcast transmission with a point-to-multipoint transmission. This communication system involves diﬀerent tasks. The aim of each task can be diﬀerent and sometimes contradictory but the system must globally address the following problem: Transmit digital information over the propagation channel with maximum rate and minimum error probability. Actually, the heart of this

330

Turbo Processing

problem is the propagation channel because its physical nature always leads to a limited available frequency bandwidth and a non-ideal frequency response. Consequently, the channel corrupts the transmitted signal by introducing distortion, noise disturbances, and other interference. The general block diagram shown in ﬁg. 11.16 depicts the basic elements of a digital communications system. The source encoder converts the original message into a sequence with as few binary bits as possible in order to save bandwidth and to optimize the transmission rate. Unlike the source encoder, the channel encoder adds redundancy to the compressed message to form codewords. As detailed in the previous sections, the purpose of this function is to increase the reliability of the transmitted data that will be distorted and impaired by the channel. Finally the digital modulator converts the codewords into waveforms that are compatible with the channel. At the receiver part the demodulator performs a conversion of the waveforms into an analog-like sequence. This sequence is then fed to the decoder, which attempts to reconstruct the compressed message from the constraints of the code. Finally, the source decoder reconstructs the original message from the knowledge of the source encoding algorithm. The above description is in fact a simpliﬁed version of a digital communication system, and several other functions have been omitted. The functions in a receiver can be classiﬁed into three categories: estimation, detection, and decoding (ﬁg. 11.17). By estimation, we mean the estimation of certain channel parameters. Carrier Transmitted message

Source encoder

Channel encoder

Modulator Channel

Received message Figure 11.16

Received message

Source decoder

Channel decoder

Demodulator

Basic representation of a digital communications system.

Source decoder

Channel decoder

Detection Channel information

Estimation

Figure 11.17

A more detailed receiver.

Demodulator

11.8

Other Applications of the Turbo Principle

331

phase recovery, frequency oﬀset estimation, timing recovery, and channel impulse response estimation belong to this category. Detection involves all processing of interference introduced by the channel such as intersymbol interference, multiple access interference, cochannel interference, and multi-user interference. Whatever the type of interference, the same mathematical models can be exploited and algorithms based on identical structures and criteria can be derived. The data are ﬁnally recovered by channel decoding. In this conventional representation, each processing step is performed separately. Detection and estimation modules ignore channel and source coding and work as if the received symbols were mutually independent. Information theory informs us that the maximum likelihood (ML) criterion should be globally applied over the whole receiver in order to minimize the error probability, which is the ﬁnal target. Unfortunately, with so many factors to take into account, the complexity becomes prohibitive and the problem totally intractable. Consequently, the optimal ML approach is never implemented in practice and system engineers prefer the conventional receiver of ﬁg. 11.17 despite its suboptimality. Nevertheless, the loss of performance due to this suboptimality is kept as low as possible thanks to the use of soft information. As already mentioned at the beginning of this chapter, the process of making hard decisions discards a part (i.e., the magnitude, as an image of the reliability) of the inherent information present in the received data. For instance, instead of making hard decisions, it is better for the detector to deliver soft information to the decoder. 11.8.1

Feedback Process: The Answer to Suboptimality

Using soft information in the conventional receiver is not suﬃcient to reach the same performance as that of the optimal ML receiver. The loss of performance essentially comes from the fact that the tasks at the front of the receiver, typically detection and estimation, do not beneﬁt from the work of the channel decoder. In a turbo decoder, this problem has been solved by introducing a feedback loop between the component decoders. This loop enables a bidirectional exchange of soft information between these decoders. Each decoder beneﬁts from the work done by the other. The turbo principle can also be applied to solve the suboptimality issue of the conventional receiver. This approach was originally proposed by Douillard et al. (1995) to jointly process equalization and decoding. In a turbo receiver, an iterative process replaces the sequential process of the conventional receiver of ﬁg. 11.17. The received data are now processed several times by the detector, the estimator, and the decoder. Each pass through the estimation/detection/decoding scheme is called an iteration as in a turbo decoder. The iterative receiver is illustrated in ﬁg. 11.18. At the ﬁrst iteration, the detection and the estimation functions do not beneﬁt from any a priori information about the transmitted data. The soft information provided to the decoder is then the same as in the conventional sequential receiver. In turn, the decoder provides its own soft information for estimation/detection, creating a feedback loop. From

332

Turbo Processing

the second iteration, this feedback information acts as a priori information for estimation/detection that improves their results. Figure 11.18 also indicates a turbo process between the channel decoder and the source decoder. In this situation, constraints inherent to the source encoder may help the channel decoder work.

Received message

Source decoder

Channel decoder

Detection

Demodulator

Channel information

Estimation

Illustration of a simpliﬁed turbo receiver. The permutation functions are not represented.

Figure 11.18

Thus a turbo receiver involves the repetition of information exchanges a number of times, the channel decoder playing a central role thanks to the amount of redundancy it beneﬁts from. If the SNR is not too bad, the turbo receiver produces new and better estimates at each iteration. Nevertheless, this performance improvement is bounded by the ML performance, which is impassable. In section 11.5, the role and the importance of the permutation function were emphasized. In the turbo receiver, permutations (which are not represented in ﬁg. 11.18) also play a crucial role by minimizing the correlation eﬀects in consecutive pieces of information exchanged by the diﬀerent processors. In particular, an appropriate permutation located after the decoder avoids the consequences of possible residual error bursts. The implementation of the turbo receiver assumes that all functions are able to accept and deliver soft extrinsic information. In particular, the APP algorithm is particularly well suited to the decoder, especially if it is a turbo decoder. 11.8.2

Probabilistic Detection and Estimation Algorithms

Actually, before the invention of turbo coding and decoding, probabilistic algorithms with a priori probabilities for conventional detection or estimation tasks were not needed and so, they were not known! Consequently, before implementing an iterative process between estimation/detection and decoding tasks, probabilistic algorithms for estimation and detection have to be elaborated. In fact, the APP algorithm suits the problem of detection well, whereas the expectation maximization (EM) algorithm (Moon, 1996), also a probabilistic algorithm, is more adapted to the problem of estimation. Lots of iterative digital receivers are based on these two algorithms, whose major advantage is optimality. Furthermore, iterative exchanges

11.8

Other Applications of the Turbo Principle

333

between detection, or estimation, and decoding can be implemented directly thanks to a priori probabilities. However, these algorithms are computationally demanding, and the resulting complexity of the receiver may become prohibitive. In this situation, suboptimal alternatives are needed. 11.8.3

Suboptimal Algorithms

The principles of these alternatives were already known for conventional detection and estimation. They are suboptimal because they are based on criteria other than minimizing the error probability, such as minimizing the mean square error (minimum mean square error, MMSE) or maximizing the SNR. But these suboptimal algorithms also lead to less complex implementations, often in the form of ﬁlterbased structures. As these structures were traditionally used in the conventional receiver of ﬁg. 11.17, no a priori input was required. In order to exchange information with the channel decoder, detection and estimation algorithms have to take into account a priori probabilities and also deliver probabilities. Two solutions have been considered in the literature to implement soft-in/softout detectors and estimators. The ﬁrst one involves modifying or adapting preexisting techniques. The second one is to derive totally new architectures. Let us consider the ﬁrst approach. The problem can be formulated as: How can the a priori input sequence be utilized in a conventional estimation/detection algorithm, whereas so far they only involved the processing of the received data? In fact, several estimation or detection algorithms already use feedback of estimated data, for example, the DFE (decision feedback equalizer) for combating intersymbol interference (Proakis, 2000) and the SIC (successive interference cancellation) receiver for multi-user transmission (see Dai and Poor, 2002, for instance). An illustration of these locked-up schemes is given in ﬁg. 11.19. The loop generally involves hard decisions on symbols, provided by a hard slicer, that are utilized to improve the estimation/detection result. The derivation of the algorithms is generally based on the assumption that the hard decisions fed back to the detection/estimation algorithm correspond exactly to the transmitted symbols. In a turbo process, the decoder Hard decision

Detection

From the demodulator

Channel information Estimation

Figure 11.19

scheme.

Illustration of a conventional locked-up estimation/detection

334

Turbo Processing

can replace the hard slicer and now provide improved feedback information. The detection or estimation algorithms are not modiﬁed and are still based on the assumption of perfect feedback. As the channel decoder provides probabilities and the detection/estimation algorithms symbol estimates, a mapper is needed between the channel decoder and the estimation/detection algorithms. This mapper generates an estimate of transmitted symbols from the a posteriori or extrinsic probabilities provided by the decoder. This estimation is then directly used as feedback into the suboptimal detection and estimation algorithms, as explained previously. However, more potential gain is available if the assumption of perfect feedback is dropped in the derivation of the algorithms. Indeed, the assumption of perfect feedback in a turbo process is not valid since the data estimates are improved iteration after iteration. Perfect feedback is achieved only when the convergence of the turbo process is attained. Consequently, retaining the assumption of perfect feedback in the derivation of the detection/estimation algorithms, as in the ﬁrst approach, generally leads to nonoptimal solution. A new derivation of the algorithm becomes necessary. The derivation is still based on the initial suboptimal criteria, MMSE for example. But, in addition, feedback decoder probabilities, replacing the perfect feedback assumption, are introduced into the derivation. Very powerful algorithms are thus derived, generally diﬀerent from conventional algorithms and often based on an interference canceller structure. Furthermore, they oﬀer very signiﬁcant complexity advantages over probabilistic algorithms. Iterative schemes based on well-designed suboptimal algorithms achieve very good performance and can potentially lead to exactly the same performance bound as iterative schemes based on probabilistic algorithms. The major diﬀerence is the convergence speed. More iterations are sometimes necessary to achieve the performance bound. In conclusion, the turbo principle has found numerous applications in communication receivers. It has proved its capacity to achieve performance very close to that of the ML receiver, while requiring signiﬁcantly reduced complexity. 11.8.4

Further References

Because the literature in the domain of turbo receivers is huge, we will restrict references to pioneering papers or overview papers. As already mentioned, the turbo principle was extended for the ﬁrst time to the problem of joint equalization and decoding in 1995 (Douillard et al., 1995). This new turbo receiver, called a turbo detector, was based on an APP equalizer and an APP channel decoder, and demonstrated quasi-optimal performance. The second successful attempt concerned coded modulation. Robertson and W¨ orz (1996) introduced an eﬃcient coding scheme for high-order modulations based on the parallel concatenation of Ungerboeck codes (Ungerboeck, 1982). This technique, named turbo trellis coded modulation (TTCM), provides good performance on AWGN channels.

11.8

Other Applications of the Turbo Principle

335

Later, Glavieux et al. (1997) proposed a low complexity version of a turbo equalizer based on adaptive MMSE ﬁlters . For the ﬁrst time, a nonprobabilistic algorithm was introduced into a turbo receiver. Here, a priori information is taken into account in the ﬁlter coeﬃcient computation thanks to an adaptive algorithm such as the least mean square (LMS) algorithm. At the same time, Hagenauer introduced the “turbo” principle, extending this concept to several tasks in a communication system: joint source and channel decoding, coded modulation, and multi-user detection (Hagenauer, 1997b). Following these pioneering works, a huge number of papers was then devoted to the turbo principle. For instance, a bit-interleaved coded modulation (BICM) turbo decoder for ergodic channels has been proposed by ten Brink et al. (1998) and Chindapol et al. (1999) independently. This technique, called iterative demapping or BICM-ID (iterative decoder), is based on a feedback loop between the demapper, which computes LLRs, and the channel decoder. The a priori information provided by the decoder is used in the soft demapper to remove the assumption of independent and identically distributed coded bits. This technique needs nonGray mapping in order to provide performance gains. In the domain of synchronization, Langlais and H´elard (2000) have addressed the problem of carrier phase recovery for turbo decoding at low SNRs, by using tentative decisions from the ﬁrst component decoder in the carrier phase recovery loop. As this system does not exploit the iterative structure of the turbo decoder, Lottici and Luise (2002) then developed a carrier phase recovery embedded in turbo decoding. Recently, Barry et al. (2004) have proposed an overview of methods for implementing timing recovery with turbo decoding. In the past few years many signiﬁcant developments have arisen in the ﬁeld of iterative multi-user detection. A number of references can be found in the paper by Poor (2004). Finally, the turbo principle can also be applied to multiple-antenna detection as in MIMO (multi-input multi-output) systems. The famous BLAST system from Bell Labs has thus been transposed in a turbo receiver, in which the multi-antenna detector exchanges information with the channel decoder (Haykin et al., 2004). Optimal and suboptimal multiple-antenna detectors are also presented, leading to various implementation complexities.

12

Blind Signal Processing Based on Data Geometric Properties

Konstantinos Diamantaras

12.1

Introduction Blind signal processing deals with the outputs of unknown systems excited by unknown inputs. At ﬁrst sight the problem seems intractable, but a closer look reveals that certain signal properties allow us to extract the inputs or to identify the system up to some, usually not important, ambiguities. Linear systems are mathematically most tractable and, naturally, they have attracted most of the attention. Depending on the type of the linear system, blind problems arise in a wide variety of applications, for example, in digital communications (Diamantaras and Papadimitriou, 2004a,b; Diamantaras et al., 2000; Godard; Papadias and Paulraj, 1997; Paulraj and Papadias, 1997; Shalvi and Weinstein, 1990; Talwar et al., 1994; Tong et al., 1994; Torlak and Xu, 1997; Treichler and Agee, 1983; Tsatsanis and Giannakis, 1997; van der Veen and Paulraj, 1996; van der Veen et al., 1995; Yellin and Weinstein, 1996), in biomedical signal processing (Choi et al., 2000; Cichocki et al., 1999; Jung et al., 1998; Makeig et al., 1995, 1997; McKeown et al., 1998; Vig´ ario et al., 2000), in acoustics and speech processing (Douglas and Sun, 2003; Parra and Spence, 2000; Parra and Alvino, 2002; Shamsunder and Giannakis, 1997), etc. Many recent books on the subject (Cichocki and Amari, 2002; Haykin, 2001a,b; Hyv¨ arinen et al., 2001) provide extensive discussion on related problems and methods. The most general ﬁnite, linear, time invariant (LTI) system is expressed by a multichannel convolution of length L, operating on a discrete vector signal s(k) = [s1 (k), · · · , sn (k)]T , x(k) =

L−1

Hi s(k − i).

(12.1)

i=0

The FIR ﬁlter taps Hi are complex matrices, in general, of size m × n, m ≥ 1. Thus the output is an m-dimensional complex vector x(k). For n, m > 1, equation 12.1 describes a linear, discrete, multi-input multi-output (MIMO) system.

338

Blind Signal Processing Based on Data Geometric Properties

12.1.1

Types of Mixing Systems

We shall study two special cases of system 12.1 sharing many similarities but also having some special characteristics as described below: Instantaneous Mixtures: In this case we have more than one source and more than one observation, i.e., m, n > 1, but there is no convolution involved, so L = 1. The output vector is produced by a linear, instantaneous transformation: x(k) = Hs(k).

(12.2)

This type of system is also called memoryless. Single-Input Single-Output (SISO) Convolution: In this case we have exactly one source and one observation, so m, n = 1, but the convolution is nontrivial, i.e., L > 1. x(k) =

L−1

hi s(k − i).

(12.3)

i=0

equation 12.3 describes a linear, SISO FIR ﬁlter. 12.1.2

Types of Blind Problems

Regardless of the speciﬁc system type, there are two kinds of blind problems which are of interest here, depending on whether we desire to extract the input signals or the system parameters. Blind Source Extraction: In this type of problem our goal is to recover the source(s) given the observation signal x(k) or x(k). If there are more than one source the problem is called blind source separation (BSS ). In the case of BSS the linear system may be either instantaneous or convolutive (general MIMO). In the case of blind deconvolution (BD) we want to invert a linear ﬁlter which, of course, operates on its input via the convolution operator, hence the name deconvolution attributed to this problem. The problem is very important, for example, in wireless communications, where n transmitted signals corrupted by intersymbol interference (ISI), multi-user interference (MUI), and noise are received at m antennas. The source separation/extraction problem has an inherent ambiguity in the order and the scale of the sources: the original signals can not be retrieved in their original order or scale unless some further information is available. For example, if the source samples (symbols) are drawn from a known ﬁnite alphabet then there is no ambiguity in the scale. If however, the alphabet is symmetric with respect to zero, then there exists a sign ambiguity since both signals s(k) and −s(k) are plausible. Furthermore, the ordering ambiguity is always present if the problem involves more than one source.

12.1

Introduction

339

Blind System Identiﬁcation: In this type of problem our goal is to obtain the system parameters rather than recovering the source signals. If the system is memoryless then our goal is to recover the mixing matrix H. If the system involves nontrivial convolution then the goal is to extract the ﬁlter taps h0 , . . . , hL−1 , or H0 , . . . , HL−1 . 12.1.3

Approaches to Blind Signal Processing

Typically, blind problems are approached either using statistical properties of the signals involved, or exploiting the geometric structure of the data constellation, as described next. Higher-Order Methods: According to the central limit theorem, the system output—which is the sum of many input samples—will approach the Gaussian distribution, irrespective of the input distribution. A characteristic property of the Gaussian distribution is that all higher-order cumulants (for instance, the kurtosis) are zero. If the inputs are not normally distributed, their higher-order cumulants will be nonzero, for example positive, and so equation 12.1 will work as a “cumulant reducer.” Clearly, the blind system inversion—the linear transform that will recover the sources from the output—should function as a “cumulant increaser,” i.e., it should maximize the absolute cumulant value for a given signal power. In fact, this is the basic idea behind all higher-order methods. 1. Second-Order Methods: Alternatively, second-order methods can be applied when the sources have colored spectra, regardless of their distribution. If the source colors are not identical then the time-delayed covariance matrices have a certain eigenvalue structure which reveals the mixing operator, in the memoryless case. This information can be used for recovering the sources as well. In the dynamic case, things are more complicated, although, again, second-order methods have been proposed based on the statistics of either the frequency or the time domain. 2. A Third Approach: Exploiting The Signal Geometry: Neither higher-order nor second-order methods exploit the cluster structure or shape of the input data when such a structure or shape exists. Consider for example a source signal s(k) whose samples are drawn from a ﬁnite alphabet AM = {±1, · · · ± (M/2)} (M = even). Let the SISO FIR ﬁlter described in equation 12.3 be excited by s(k). Writing N equations (N ≥ L) of the form (in equation 12.3) for N consecutive values of k, we obtain the following matrix equation: ⎤ ⎡ ⎤⎡ ⎡ s(k) s(k − 1) · · · s(k + 1 − L) h0 x(k) ⎥ ⎢ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎢ h1 ⎢ s(k + 1) s(k) s(k + 2 − L) x(k + 1) ⎥ ⎢ ⎥⎢ ⎢ ⎥=⎢ ⎢ ⎥⎢ . .. .. . .. . ⎢ ⎥ ⎢ ⎥ ⎢ .. . . . . ⎦ ⎣ ⎣ ⎦⎣ x(k + N − 1) hL−1 s(k + N − 1) s(k + N − 2) · · · s(k + N − L) (12.4) x = Sh

(12.5)

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

340

Blind Signal Processing Based on Data Geometric Properties

where the N × L Toeplitz matrix S involves (N + L − 1) unknown input symbols. It is possible, in principle, to identify h in a deterministic way by an exhaustive search over all M N +L−1 possible S’s such that minh x − Sh 2 = 0. Although it is highly impractical, this observation tells us that there is more to blind signal processing than statistical processing. If the sources, for example, have a certain structure which produces clusters in the data cloud, or the input distribution is bounded (e.g., uniform), then one can exploit the geometric properties of the output constellation and derive fast and eﬃcient deterministic algorithms for blind signal processing. These methods are treated in this chapter. In particular, section 12.2 discusses blind methods for systems with ﬁnite alphabet sources. The discussion covers both the instantaneous and the convolutive mixtures and it is based on the geometric properties of the data cloud. Section 12.3 discusses the case of continuousvalued sources that are either sparse or have a speciﬁc input distribution, for example, uniform. Our discussion on continuous sources covers only the case of instantaneous systems. Certainly there is a lot of room for innovation along this line of research since many issues, today, remain open.

12.2

Finite Alphabet Sources Blind problems involving sources with ﬁnite alphabets (FA) have drawn a lot of attention, because such types of signals are common in digital communications. Popular modulation schemes, for instance, quadrature amplitude modulation QAM), pulse amplitude modulation (PAM) and binary phase-shift keying (BPSK), produce signals with limited numbers of symbols. A large body of literature exists on the instantaneous mixture problem, not only because it is the simplest one but also because most methods dealing with the more realistic convolutive mixture problem lead to the solution of an instantaneous problem. In Anand et al. (1995), the blind separation of binary sources from instantaneous mixtures is approached using separate clustering and bit-assignment algorithms. An extension of this method is presented in Kannan and Reddy (1997), where a maximum likelihood (ML) estimate of the cluster centers is provided. Talwar et al. (1996) presented two iterative leastsquares methods: ILSP (iterative least squares with projection) and ILSE (iterative least squares with enumeration) for the BSS of binary sources. The same problem is treated in van der Veen (1997), where the real analytical constant modulus algorithm (RACMA) is introduced based on the singular value decomposition (SVD) of the observation matrix. In Pajunen (1997), an iterative algorithm is proposed for the blind separation of more binary sources than sensors. Finite alphabet sources and instantaneous mixtures are discussed in Belouchrani and Cardoso (1994) where a ML approach is proposed using the EM algorithm. Grellier and Comon (1998) introduce a polynomial criterion and a related minimization algorithm to separate FA sources. In all the above methods the geometric properties of the data cloud are not explicitly used. Geometrical concepts, such as the relative distances between the cluster centers, were introduced in Diamantaras (2000) and Diamantaras

12.2

Finite Alphabet Sources

341

and Chassioti (2000). It turns out that just one observation signal is suﬃcient for blindly separating n binary sources, in the noise-free case, under mild assumptions. A similar algorithm based on geometric concepts was later proposed in Li et al. (2003). In this section we shall study the geometric structure of data constellations generated from linear systems operating on signals with ﬁnite alphabets. We’ll ﬁnd that the geometry of the obtained data cloud contains information pertaining to the generating linear operator. This information can be exploited either for the blind extraction of the system parameters or for the blind retrieval of the original sources. 12.2.1

Instantaneous Mixtures Of Binary Sources

The simplest alphabet is the two-element set, or binary alphabet Aa = {−1, 1}. We shall assume that the samples of some source signals are drawn from Aa , and the signals will be called binary antipodal or, simply, binary. In digital communications the carrier modulation scheme using symbols from Aa is called binary phase-shift keying (BPSK). The reader is encouraged to verify that our results can be easily generalized to any type of binary alphabet, for example, the nonsymmetric set Ab = {0, 1}. In this subsection we shall concentrate on problem type 1, i.e., on linear memoryless mixtures of many sources, n > 1. Depending on the number of output signals (observations) m, we treat three distinct cases: m = 1; m = 2; and m > 2. 12.2.1.1

A Single Mixture

The instantaneous mixture of n sources linearly combined into a single observation is described by the following equation: x(k) =

n

hi si (k) = hT s(k),

(12.6)

i=1

h = [h1 · · · hn ]T ,

s(k) = [s1 (k) · · · sn (k)]T .

We assume that the mixing coeﬃcients hi are real and that si (k) ∈ Aa . If the coeﬃcients are complex, then the problem corresponds to the case m = 2, which is treated later. We start by studying the noise-free system since our primary interest is to investigate the structural properties of the signals and not to develop methods to combat the noise. Of course, eventually, the development of a viable algorithm will have to deal with the noise issue.

342

Blind Signal Processing Based on Data Geometric Properties

Equation 12.6 can be seen as the projection x ˜(k) of s(k) along the direction of ˜ scaled by h : the normal vector h,

two sources

x(k) = h x ˜(k), T ˜ s(k), x ˜(k) = h

(12.8)

˜ = h/ h . h

(12.9)

(12.7)

The set of values of x(k) will be called the constellation of x(k) and it will be denoted by X . It is a set of (at most) 2n points in 1-D space, R. In order to facilitate our understanding of the geometric structure of X , let us start by assuming that there are only n = 2 sources. Thus, there exist four possible realizations of the vector s(k), which form the source constellation S = {s−− , s−+ , s+− , s++ }, where s−− = [−1, −1]T , s−+ = [−1, 1]T , s+− = [1, −1]T , and s++ = [1, 1]T . Consequently, the output constellation X also consists of four distinct values: ˜−− = hT s−− , x−− = h x ˜−+ = hT s−+ , x−+ = h x ˜+− = hT s+− , x+− = h x ˜++ = hT s++ . x++ = h x Figure 12.1 shows the projections x ˜−− , x ˜−+ , x ˜+− , x ˜++ , of the source constellation ˜ It is obvious that the relative distance S for four diﬀerent normal mixing vectors h. between the points on the projection line is a function of the angle θ between the projection line and the horizontal axis. Apparently, the problem involves a lot of symmetry. In particular, it is straightforward to verify that we obtain the same output constellation X for the angles ±θ, ±(π/2 − θ), ±(π − θ), and ±(3π/2 − θ), (any θ). This multiple symmetry is the result of the interchangeability of the two sources, s1 and s2 , as well as the invariance of the source constellation to sign changes. These ambiguities are, however, acceptable, since it is not possible to recover the original source order or the original source signs. Both the source order and the sign are unobservable as it is eminent from the following relations: x(k) = [±h1 , ±h2 ] [±s1 (k), ±s2 (k)]T , = [±h2 , ±h1 ] [±s2 (k), ±s1 (k)]T . Therefore, let us assume, without loss of generality, that the mixing vector h satisﬁes the following constraint h1 > h2 > 0.

(12.10)

Under this assumption, the elements of X are ordered: x−− = −h1 − h2 < x−+ = −h1 + h2 < x+− = +h1 − h2 < x++ = +h1 + h2 . (12.11)

343

1.5

1

1

0.5

0.5

2

1.5

0

s

s

2

Finite Alphabet Sources

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−1

−0.5

0

0.5

1

−1.5

1.5

−1.5

−1

−0.5

1.5

1

1

0.5

0.5

2

1.5

0

−0.5

−1

−1

−1.5

−1

−0.5

0

s

0.5

1

1.5

0.5

1

1.5

0

−0.5

−1.5

0

s1

s

2

s1

s

12.2

0.5

1

1.5

−1.5

−1.5

−1

−0.5

0

s

1

1

The source constellation (circles) of two independent binary sources is projected on four diﬀerent directions. The relative distances of the projection points (marked by squares) is clearly a function of the slope of the projection line.

Figure 12.1

Indeed, the ﬁrst and third inequalities in 12.11 are obvious since the mixing coeﬃcients are positive. The second inequality is also true since x+− − x−+ = 2(h1 − h2 ) > 0. Thus, by clustering the (observable) output sequence {x(1), x(2), x(3), · · · } we obtain four cluster points c1 , c2 , c3 , c4 , which can be arranged in increasing order and set into one-to-one correspondence with the elements of X . c1 = x−− < c2 = x−+ < c3 = x+− < c4 = x++ c1 = −c4 ;

(12.12)

c2 = −c3 .

Then using equation 12.11 we can recover the mixing parameters: h1 = (c3 − c1 )/2,

(12.13)

h2 = (c2 − c1 )/2.

(12.14)

344

Blind Signal Processing Based on Data Geometric Properties

Example 12.1 Figure 12.2 shows the position of the cluster points c1 , . . . , c4 , for the random mixing vector h = [0.9659, 0.2588]T . According to equations 12.11 and 12.12, these cluster points are c1 = x−−

= −1.2247

−+

= −0.7071

+−

c3 = x

=

0.7071

c4 = x++

=

1.2247

c2 = x

.

By computing the distances between the pairs (c3 , c1 ) and (c2 , c1 ), we obtain directly the unknown mixing parameters: (c3 − c1 )/2

=

0.9659 = h1

(c2 − c1 )/2

=

0.2588 = h2

.

1

0.5

2h1 2h2

0

c1=x−−

c2=x−+

−1

−0.5

c3=x+−

c4=x++

−0.5

−1.5

0

0.5

1

1.5

The distances c3 –c1 and c2 –c1 between the cluster points are equal to twice the size of the unknown mixing parameters.

Figure 12.2

If our aim is to identify the mixing parameters h1 , h2 , then equations 12.13 and 12.14 have achieved our goal. If, in addition, we want to extract the hidden sources then we may estimate each input sample s(k), separately, by ﬁnding the binary vector b = [b1 , b2 ]T ∈ A2a , so that hT b best approximates x(k). This corresponds to the following binary optimization problem, ˆ s(k) = arg min2 |x(k) − hT b|, b∈Aa

for all k.

(12.15)

Luckily the above optimization problems are decoupled, for diﬀerent k, and therefore the solution is trivial.

12.2

Finite Alphabet Sources

more than sources

2

345

The whole idea can be extended to more than two sources using recursive system deﬂation. This process iteratively identiﬁes and removes the two smallest mixing parameters, thus eventually reducing the problem to either the two-input case, which is solved as above; or the single-input case, which is trivial. Our linear mixture model is again the one described in 12.6 with n > 2 and some real mixing vector h = [h1 , · · · , hn ]T . As before, without loss of generality, we shall assume that the mixing parameters are positive and arranged in decreasing order: h1 > h2 > · · · > hn > 0.

(12.16)

We have already shown that for n = 2 the centers ci are arranged in increasing order. For n > 2 things are a bit more complicated. Let us deﬁne B(n) to be the (n) T 2n × n matrix whose ith row bi , is the binary representation of the number (i − 1) ∈ {0, · · · , 2n − 1}: ⎤ ⎡ −1 −1 · · · −1 −1 ⎥ ⎢ ⎢ −1 −1 · · · −1 1 ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ −1 −1 · · · 1 −1 ⎥ ⎥ ⎢ ⎢ . .. .. .. ⎥ . (12.17) B(n) = ⎢ .. . . . ⎥ ⎥ ⎢ ⎥ ⎢ 1 · · · −1 1 ⎥ ⎢ 1 ⎥ ⎢ ⎢ 1 1 ··· 1 −1 ⎥ ⎦ ⎣ 1 1 ··· 1 1 Although the sequence {c1 , · · · , c2n }, of the centers (n) T

ci = bi

h=

n

(n)

bij hj ,

i = 1, · · · , 2n

(12.18)

j=1

is not exactly arranged in increasing (or decreasing) order, there is a lot of structure in the sequence as summarized by the following facts (Diamantaras and Chassioti, 2000): The ﬁrst three centers c1 < c2 < c3 are the three smallest values in the sequence ci . Similarly, the last three centers c2n −2 < c2n −1 < c2n are the three largest values in the sequence {ci }. The sequence c1 , . . . , c2n , deﬁned by 12.18 consists of consecutive quadruples, each arranged in increasing order: c4i+1 < c4i+2 < c4i+3 < c4i+4 , i = 0, · · · , 2n−2 − 1 The smallest element of the ith quadruple is n−2

c4i+1 = [

j=1

(n)

b4i+1,j hj ] − hn−1 − hn .

(12.19)

346

Blind Signal Processing Based on Data Geometric Properties

The diﬀerences δ1 = c4i+2 − c4i+1 = 2hn

(12.20)

δ2 = c4i+3 − c4i+1 = 2hn−1

(12.21)

δ3 = c4i+4 − c4i+1 = 2(hn−1 + hn )

(12.22)

between the members of the ith quadruple are independent of i. Since c2 = c1 + 2hn , and c3 = c1 + 2hn−1 the two smallest mixing parameters hn−1 , hn can be retrieved using the values of the three smallest centers c1 , c2 , and c3 : hn = (c2 − c1 )/2 ,

(12.23)

hn−1 = (c3 − c1 )/2 .

(12.24)

Once we have obtained hn−1 and hn we can deﬁne a new sequence {ci } by picking the ﬁrst elements of each quadruple shifted by the sum (hn−1 + hn ), thus obtaining ci = c4(i−1)+1 + hn−1 + hn =

n−2

(n)

b4(i−1)+1,j hj ,

i = 1, · · · , 2n−2 .

(12.25)

j=1

Notice however, that the ﬁrst n − 2 bits of the [4(i − 1) + 1]-th row of B(n) are all the bits of the ith row of B(n−2) . In other words, (n)

(n−2)

b4(i−1)+1,j = bij

,

j = 1, · · · , n − 2

therefore, (n−2) T

ci = bi

h=

n−2

(n−2)

bij

hj ,

i = 1, · · · , 2n−2 .

(12.26)

j=1

Using these facts the following recursive algorithm is constructed: Algorithm 12.1 : n binary sources, one observation Step 1: Compute the centers ci and sort them in increasing order. Step 2: Compute hn , hn−1 , from equations 12.23 and 12.24. Step 3: Compute the diﬀerences δi , using equations 12.20, 12.21, and 12.22. Step 4: Remove the set {c1 , c2 , c3 , c1 + δ3 } from the sequence {ci }. Set c1 = c1 + hn + hn−1 as the ﬁrst element of a new sequence {ci }. Step 5: Repeat until all elements have been removed: Find the smallest element cj of the remaining sequence {ci }; Remove the set {cj , cj + δ1 , cj + δ2 , cj + δ3 } from {ci };

12.2

Finite Alphabet Sources

347

Keep cj + hn + hn−1 as the next element of the sequence {ci }. At the end, the new sequence {ci } will be four times shorter than the original {ci }. Step 6: Recursively repeat the algorithm for the new sequence {ci } and for a new n = n − 2 to obtain hn = hn−2 , hn −1 = hn−3 . Eventually, n = 2 or n = 1. Steps 4 and 5 are the basic recursion which reduces the problem size from n to n−2 by replacing the sequence ci by ci . At step 6, we will iteratively obtain the pairs (hn , hn−1 ), (hn−2 , hn−3 ), . . . , until we reach the case where n = 2 or n = 1. The case for n = 2 sources was treated in the previous subsection. The case for n = 1 is trivial since it involves only one source. In this case, the observation is simply a scaled version of the input, x(k) = h1 s1 (k), thus, the estimation of h1 and s(k) is easy: we have h1 = |x(k)| (since |s(k) = 1| and h1 > 0) and so s(k) = x(k)/h1 . Example 12.2 Consider the following system with four sources and one observation: x(k) = −0.4326s1 (k) + 1.2656s2 (k) + 0.1553s3 (k) − 0.2877s4 (k). The mixing vector h = [−0.4326, 1.2656, 0.1553, −0.2877] does not satisfy equation ˆ = [1.2656, 0.4326, 0.2877, 0.1553], 12.16. The algorithm will recover the vector h which does satisfy equation 12.16, and it is identical to h except for the permutation and sign changes of its elements. Step 1: The sorted sequence of centers is c={

−2.1412, −1.8306, −1.5658, −1.2760, −1.2552, −0.9654, −0.7006, −0.3900, 0.3900, 0.7006, 0.9654, 1.2552, 1.2760, 1.5658, 1.8306, 2.1412 }

ˆ 3 = 0.2877, h ˆ 4 = 0.1553. Step 2: Using equations 12.23 and 12.24 we compute h Step 3: Using equations 12.20, 12.21, and 12.22, we obtain δ1 = 0.3106, δ2 = 0.5754, δ3 = 0.8860. Step 4: Remove {c1 , c2 , c3 , c1 + δ3 } = {−2.1412, −1.8306, −1.5658, −1.2552} from c. Set c1 = −1.6982. New sorted sequence: c={

−1.2760, −0.9654, −0.7006, −0.3900, 0.3900, 0.7006, 0.9654, 1.2552, 1.2760, 1.5658, 1.8306, 2.1412 }.

Step 5: Remove {c1 , c1 +δ1 , c1 +δ2 , c1 +δ3 } = {−1.2760, −0.9654, −0.7006, −0.3900} from c. Set c2 = −0.8330. New sorted sequence: c = {0.3900, 0.7006, 0.9654, 1.2552, 1.2760, 1.5658, 1.8306, 2.1412}. Step 6: Remove {c1 , c1 + δ1 , c1 + δ2 , c1 + δ3 } = {0.3900, 0.7006, 0.9654, 1.2760}. Set c3 = 0.8330. New sorted sequence: c = {1.2552, 1.5658, 1.8306, 2.1412}.

348

Blind Signal Processing Based on Data Geometric Properties

Step 6: Remove {c1 , c1 + δ1 , c1 + δ2 , c1 + δ3 } = {1.2552, 1.5658, 1.8306, 2.1412}. Set c4 = 1.6982. New sorted sequence: c = ∅. The new sequence c = {−1.6982, −0.8330, 0.8330, 1.6982} yields the estimates ˆ 2 = 0.4326. ˆ 1 = 1.2656, h of the remaining mixing parameters h 12.2.1.2

Two Mixtures

In the case of m = 2 mixtures the observed data x(k) lie in the two-dimensional space R2 . Although it is possible to see each mixture separately as a single-mixture– multiple-sources problem, as the one treated in the previous subsection, this is not the most eﬃcient approach to the problem. It turns out that the 2D structure of the output constellation reveals the mixing operator H in a very elegant and straightforward way. To see that, let us start by considering the data constellation of a binary antipodal signal s1 (k) (ﬁg. 12.3a). The constellation actually consists of two points on the real axis: s− = −1 and s+ = 1. Next, consider a linear transformation (1) (1) from R1 to R2 which maps s1 (k) to a vector signal x(1) (k) = [x1 (k), x2 (k)]T : x(1) (k) = h1 s1 (k).

(12.27)

The linear operator h1 = [h11 , h12 ]T is a two-dimensional vector shown in ﬁg. 12.3b. The constellation of x(k) is shown in ﬁg. 12.3c, and it also consists of two points x− = −h1 = s− h1 and x+ = h1 = s+ h1 . Now let us look at shape of the data cloud corresponding to the linear combination of several binary antipodal sources s1 (k), . . . , sn (k). It is instructive to study the shape of this cloud as n increases gradually from n = 2 and upward. The linear mixture of n = 2 sources x(2) (k) = h1 s1 (k) + h2 s2 (k)

(12.28)

has the geometric structure shown in ﬁg. 12.4b, for the mixing vectors h1 , h2 , shown in ﬁg. 12.4a. The data cluster contains four points: x++ = s+ h1 + s+ h2 , x+− = s+ h1 + s− h2 , x−+ = s− h1 + s+ h2 , and x−− = s− h1 + s− h2 . Adding a third source s3 (k) with the mixing vector h3 , the data mixture x(3) (k) = h1 s1 (k) + h2 s2 (k) + h3 s3 (k)

(12.29)

has the constellation shown in ﬁg. 12.5. Now the data cluster contains eight points: x+++ = s+ h1 + s+ h2 + s+ h3 , x++− = s+ h1 + s+ h2 + s− h3 , x+−+ = s+ h1 + s− h2 + s+ h3 , x+−− = s+ h1 + s− h2 + s− h3 , x−++ = s− h1 + s+ h2 + s+ h3 , x−+− = s− h1 + s+ h2 + s− h3 , x−−+ = s− h1 + s− h2 + s+ h3 , and x−−− = s− h1 + s− h2 + s− h3 . By simple inspection of ﬁgures 12.3, 12.4, and 12.5, one can make the following useful observations:

12.2

Finite Alphabet Sources

349

1.5

1

0.5

s−

0

s+

−0.5

−1

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

(a) 1.5

1.5

1

1

h

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−1

−0.5

0

(b)

x+

0.5

1

0.5

1

1.5

2

−1.5

x

−1.5

−1

−0.5

−

0

0.5

1

1.5

(c)

(a) Data constellation of a binary antipodal signal s1 (k). (b) Linear transformation vector h1 . (c) Data constellation of the transformed signal x(1) (k) = h1 s1 (k). Figure 12.3

1. The number of cluster points is 2n , where n is the number of binary sources. 2. The data constellation is a symmetric, self-repetitive ﬁgure. While the symmetry is obvious, the self-repetitive structure can be seen by comparing, for example, ﬁg. 12.5b against ﬁg. 12.4b. The ﬁrst consists of two copies of the latter shifted by the vectors −h3 and h3 . The same is true for ﬁgs. 12.4b and 12.3c except that the shift is by the vectors −h2 and h2 . 3. For every cluster point there exist n copies at the directions h1 or −h1 , and h2 or −h2 , . . . , and hn or −hn . It is even more interesting and, in fact, very useful to study the properties of the convex hull of the data constellation set. By deﬁnition, the convex hull of a set of points in 2D space is the smallest polygon that contains them or, in other words, the bounding polygon for these points. Figures 12.6a–c show the convex hulls H1 , H2 , and H3 for the data constellations corresponding to the mixtures x(1) , x(2) , and x(3) , respectively. Let d be the distance between the two alphabet symbols.

350

Blind Signal Processing Based on Data Geometric Properties

1.5

1.5

1

1

x h

2

h

0.5

++

0.5

1 −+

x 0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−1

−0.5

0

0.5

1

−1.5

1.5

x

+−

x− −

−1.5

−1

−0.5

0

(a)

0.5

1

1.5

(b)

Figure 12.4 (a) Mixing vectors h1 , h2 . (b) Data cluster for the mixture x(2) (k) = h1 s1 (k) + h2 s2 (k) of two binary antipodal sources.

1.5

1.5

1

1

x

h 0.5

2

h

x

0.5

1

0

−0.5

−1

−1

−1.5

−1

−0.5

0

(a)

0.5

1

1.5

−1.5

x

+−+

x

+−−

x− − − x

−1.5

−+−

x

3

−0.5

x −++

0

h

++−

+++

−1.5

−1

−0.5

−−+ 0

0.5

1

1.5

(b)

Figure 12.5 (a) Mixing vectors h1 , h2 , h3 . (b) Data cluster for the mixture x(3) (k) = h1 s1 (k) + h2 s2 (k) + h3 s3 (k) of three binary antipodal sources.

It can be shown that any convex hull H satisﬁes the following properties (for the proof, see Diamantaras, 2002): 1. Every edge e of H is parallel to some mixing vector hi , i ∈ {0, 1, · · · , n}. Also, e has length d hi . For the binary antipodal alphabet Aa , we have d = 2. 2. Every vector hi corresponds to a pair of edges, i.e., it is parallel to two edges ei and ei of equal length d hi . It follows that H has 2n edges. 3. H is symmetric. If the alphabet is symmetric around 0 (e.g., Aa ) then the center of symmetry is the point xO = 0. Otherwise, the center of symmetry is a nonzero ∈ Rm . point xO

12.2

Finite Alphabet Sources

351

1.5

1

0.5

2h

0

1

−0.5

−1

(a)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

1

2h

1

2h

0.5

2

0

2h

−0.5

2

2h

1

−1

(b)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

2h

3

1

2h

0.5

2h

1

2

0

−0.5

2h

2h

2

1

−1

2h

(c)

−1.5

−1.5

−1

−0.5

3 0

0.5

1

1.5

Figure 12.6 Convex hulls for data constellations of mixtures of n binary sources. (a) n = 1, (b) n = 2, (c) n = 3.

352

Blind Signal Processing Based on Data Geometric Properties

These important results show that there is a two-to-one correspondence between the edges of the convex hull and the unknown mixing vectors: there is a pair of edges parallel to each mixing vector, and furthermore, the edges have length equal to d times the length of their corresponding mixing vectors. Thus we easily come to the following procedure for the identifying the hi ’s: Algorithm 12.2 : n binary sources, two observations Step 1: Find the constellation set X of the 2D mixture x(k). Step 2: Compute the convex hull H of X . Step 3: H consists of 2n edge pairs {ei , ei }, ei ei , i = 1, · · · , n. The number of sources is n. Step 4: The mixing vectors are: hi = ei /d, up to an unknown ordering and sign. Of course, the original order and sign of the vectors are irretrievable. As we have seen, this is a general, problem-inherent limitation and it is not speciﬁc to this (or any other) particular method. In fact, the limitation cannot be overcome without additional information regarding the sources or the mixing operators. 12.2.1.3

More Than Two Mixtures

It is not diﬃcult to see that the whole convex hull idea can be extended to the case where m ≥ 3. Again, the edges of the convex hull will be parallel to the mixing vectors hi except that, now, the convex hull lies in Rm . The algorithms for computing the convex hull in m-dimensional spaces are not as simple as the ones for the 2D case. For a comprehensive discussion of this topic see Preparata and Shamos (1985). 12.2.2

Instantaneous Mixtures of M -ary Alphabet Sources

The results of section 12.2.1 can be easily extended to M -ary signals, i.e., signals whose alphabet contains M discrete and equally distributed values. For example, the alphabet A5 = {−1, −1/2, 0, 1/2, 1} contains M = 5 symbols symmetrically distributed around 0. Similar results, as in the binary case, hold here as well. Again, the convex hull directly connects the constellation geometry with the unknown mixing vectors. Let d be the distance between the maximum and minimum symbols in the M -ary alphabet AM d = max{AM } − min{AM }. Also let H be the convex hull of the constellation X of the mixture x(k) = h1 s1 (k) + · · · + hn sn (k). The the following statements are true (see ﬁg. 12.7): 1. The number of cluster points is M n , where n is the number of M -ary sources. 2. The data constellation is a symmetric, self-repetitive ﬁgure.

12.2

Finite Alphabet Sources

353

3. Every edge e of H is parallel to some mixing vector hi , i ∈ {0, 1, · · · , n}, and e has length d hi . 4. Every vector hi corresponds to a pair of edges, i.e., it is parallel to two edges ei and ei of equal length d hi . It follows that H has 2n edges. 5. H is symmetric. For alphabets symmetric around zero the center of symmetry is xO = 0. We may use algorithm 12.2 without modiﬁcations for the solution of the M -ary case as well. 12.2.3

Noisy Data

The analysis of the previous subsections pertains to systems with noiseless outputs. In most applications however, the observation is burdened with noise, either because the system itself is noisy or the receiving device introduces errors in the measurements. The additive noise model is commonly used for describing the observation error: x(k) = Hs(k) + v(k).

(12.30)

Without loss of generality, and for the sake of visualization, we shall focus on the two-output case. An entirely similar discussion holds for the cases n = 1 or n > 2. The vector signal v(k) = [v1 (k), · · · , vn (k)]T , contains the noise components vi (k) for each observed output signal i = 1, · · · , n. The constellation of x is now less crisp since the true centers are surrounded by a cloud of points (ﬁg. 12.8a). The methods presented in sections 12.2.1 and 12.2.2 can still be applied preceded by a clustering process that will estimate the actual centers from the noisy data cloud. Such clustering methods include the ISODATA or K-means algorithm (Duda et al., 2001; Lloyd, 1982; MacQueen, 1967), the EM algorithm (Dempster et al., 1977), the neural gas algorithm (Martinetz et al., 1993), Kohonen’s self-organizing feature maps (SOM) (Kohonen, 1989), RBF neural networks (Moody and Darken, 1989), and many others. For a detailed treatment of clustering methods refer to Theodoridis and Koutroubas (1998). Figure 12.8b shows the estimation of the true centers using the K-means algorithm in a system with three binary inputs and two linear output mixtures with noise power at 15dB. Notice that the estimation errors inside the convex hull do not aﬀect the results. It is only the errors at the boundary that are signiﬁcant. We apply the blind identiﬁcation method discussed earlier in this section using the estimated centers provided by K-means, obtaining the results shown in table 12.1. 12.2.4

Convolutive Mixtures of Binary Sources

The convolutive mixtures of binary sources are described by the output of the MIMO FIR system (eq. 12.1). The blind problems related to such systems are con-

354

Blind Signal Processing Based on Data Geometric Properties

1.5

1

0.5

0

2h

1

−0.5

−1

(a)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

1

2h

0.5

2h

1

2

0

2h

−0.5

2h

2

1

−1

(b)

−1.5

−1.5

−1

−0.5

0

0.5

1

1.5

1.5

2h

3

1

2h

0.5

2h 1

2

0

−0.5

2h

2h 2

1

−1

2h

(c)

−1.5

−1.5

−1

−0.5

3 0

0.5

1

1.5

Convex hulls of mixture constellations from n M -ary sources (M = 5). The source symbols are drawn from the alphabet {−1, −0.5, 0, 0.5, 1}, with maximum distance d = 2. (a) n = 1, (b) n = 2, (c) n = 3.

Figure 12.7

12.2

Finite Alphabet Sources

355

1.5

1.5

1

1

0.5 0.5 0 0 −0.5 −0.5 −1

−1

−1.5

−2 −1.5

−1

−0.5

0

0.5

1

1.5

−1.5 −1.5

−1

−0.5

0

(a)

0.5

1

1.5

(b)

(a) Data constellation for a noisy memoryless linear system with 3 binary inputs and 2 outputs (mixtures). The noise level is 15 dB. Superimposed are the true cluster centers marked with “o”. (b) True cluster centers (o) and estimated cluster centers (x) using the K -means algorithm. Also shown is the true convex hull (solid line) and the estimated convex hull (dashed line).

Figure 12.8

Table 12.1

True and Estimated Mixing Vectors

h1

ˆ1 h

h2

ˆ2 h

h3

ˆ3 h

0.3000 0.5000

0.3017 0.4902

− 0.1000 0.6000

−0.0960 0.6089

− 0.4000 −0.1000

0.3917 0.1009

siderably more diﬃcult than the corresponding instantaneous mixture problems, but at the same time, they are much more important. Convolutive mixing models, for example, can describe multipath and crosstalk phenomena in wireless communications, being in that sense much more realistic than instantaneous models. In this section we shall approach the blind source separation and blind system identiﬁcation problems of MIMO FIR models using the geometric properties of the data constellation. We shall treat, ﬁrst, the simpler single-input single-output (SISO) problem and then continue on to the multi-input single-output (MISO) case. The proper MIMO problem is not explicitly discussed since it can be seen as a multitude of m decoupled MISO problems. 12.2.4.1

Blind SISO Deconvolution as Instantaneous Blind Source Separation

In this subsection we shall use the results of the previous sections to solve the blind SISO identiﬁcation and deconvolution problems. Our approach is to relate any given SISO system with an overdetermined instantaneous mixtures model, hence

356

Blind Signal Processing Based on Data Geometric Properties

the same methods can be applied as in sections 12.2.1. Let us consider a linear, FIR, single-input single-output (SISO) system with a binary antipodal input s(k), x(k) =

L−1

hi s(k − i)

(12.31)

i=0

We shall assume that the impulse response hi , i = 0, · · · , L−1, is real. Let us create a vector sequence x(k) using time-windowing of length m on the output sequence x(k) x(k) = [x(k), · · · , x(k − m + 1)]T .

(12.32)

Then using the system 12.31 we have x(k) = Hs(k), where H is ⎡ h0 ⎢ ⎢ 0 ⎢ H=⎢ ⎢ ⎣ 0

the Toeplitz system matrix h1

···

hL−1

0

···

0

h0

··· .. .

hL−2

hL−1

0 ..

0

0

h0

···

···

.

hL−2

(12.33) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

(12.34)

hL−1

and s(k) = [s(k), s(k − 1), · · · , s(k − m − L + 2)]T .

(12.35)

Now, equation 12.33 describes m linear instantaneous mixtures xi (k) = x(k − i), i = 0, · · · , m − 1, of n sources sj (k) deﬁned as follows: sj (k) = s(k − j + 1),

j = 1, 2, · · · , n = m + L − 1.

Thus, we have successfully transformed the problem into the same form treated in section 12.2.1: xi (k) =

n

hij sj (k),

(12.36)

j=1

where hij is the (i, j)th element of H. Equivalently, we can write x(k) =

n

hj sj (k),

(12.37)

j=1

where the mixing vectors h1 , . . . , hn are the columns of H. Given the above formulation, the results of Section 12.2.1 apply directly to this problem. There are, however, some special points to be noted: 1. For any nontrivial FIR ﬁlter of length L > 1, the number of observations x1 , ..., xm is necessarily less than the number of sources s1 , . . . , sn , since n = m+L−1 > m.

12.2

Finite Alphabet Sources

357

2. The mixing vectors have no arbitrary form. For example, h1 has the form [×, 0, · · · , 0]T and hn has the form [0, · · · , 0, ×]T . 3. The sources are not independent. In fact, any one is a shifted version of any other. Next, we shall give examples for two cases: m = 1; and m = 2. Example 12.3 Time Window of Length m = 1 Suppose that we observe the output x(k) of a SISO ﬁlter h = [−0.4937, −1.1330, 0.7632, 0.1604]T excited by the binary input s(k). Using algorithm 12.1 we shall identify the ﬁlter with the necessary permutation and sign changes so that the estimated taps will be positive and arranged in decreasing order. Thus we shall ˆ = [1.1330, 0.7632, 0.4937, 0.1604]T and so obtain h ˆ 2 sˆ (k) + h ˆ 3 sˆ (k) + h ˆ 4 sˆ (k) ˆ 1 sˆ (k) + h x(k) = h 1 2 3 4 = (−h2 )(−s2 (k)) + h3 s3 (k) + (−h1 )(−s1 (k)) + h4 s4 (k). Obviously, the estimated sources sˆi correspond to the true “sources” si as follows: sˆ1 (k) = −s2 (k) = −s(k − 1), sˆ2 (k) = s3 (k) = s(k − 2), sˆ3 (k) = −s1 (k) = −s(k), sˆ4 (k) = s4 (k) = s(k − 3). Since the signals sˆi are shifted versions of the original source, s(k), it is easy to recover their correct order and relative sign changes by computing for each signal, the time shift with maximum correlation to an arbitrary reference, for example, sˆ1 . ˆ we obtain ±h. Applying the same ordering and sign changes to h, Example 12.4 Time Window of Length m = 2 Consider the same SISO ﬁlter as before and let us use time-windowing of length m = 2 to obtain the vector sequence x(k): x(k) x(k) = x(k − 1) ⎤ ⎡ s(k) ⎥ ⎢ ⎢ s(k − 1) ⎥ ⎥ ⎢ −0.4937 −1.1330 0.7632 0.1604 0 ⎢ ⎥ = ⎢ s(k − 2) ⎥ . ⎥ 0 −0.4937 −1.1330 0.7632 0.1604 ⎢ ⎢ s(k − 3) ⎥ ⎦ ⎣ s(k − 4) Using algorithm 12.2 we estimate the original mixing vectors h1 = [−0.4937, 0]T , h2 = [−1.1330, −0.4937]T , h3 = [0.7632, −1.1330]T , h4 = [0.1604, 0.7632]T , h5 = [0, 0.1604]T , but with an arbitrary order and sign change. The estimated mixing vectors can be put in the correct order by observing that the true system parameters

358

Blind Signal Processing Based on Data Geometric Properties

satisfy the following: h1,1 = h2,2 = −0.4937, h2,1 = h3,2 = −1.1330, h3,1 = h4,2 = 0.7632, h4,1 = h5,2 = 0.1604, h5,1 = h1,2 = 0 . Since the sign of each estimated vector is arbitrary, we compare the absolute valˆ j,2 |, and we change the signs of either h ˆ i,1 | against |h ˆi or h ˆj , as necessary, ues, |h ˆ i,1 = h ˆ j,2 . Once the correct order of the mixing vectors is retrieved we auso that h tomatically obtain the correct ﬁlter impulse response (up to a sign). Subsequently, the system input, s(k), is retrieved using standard (nonblind) deconvolution methods. 12.2.4.2

Blind SISO Identiﬁcation

An alternative approach for identifying the impulse response h = [h0 , · · · , hL−1 ]T of a general SISO system (eq. 12.31) has been proposed by Yellin and Porat (1993). The method is not based on constellation geometry but rather on the properties of the successor values of “equivalent” observations. The source symbols s(k) may be drawn from an M -ary alphabet AM = {±1, · · · , ±(M/2)} (M is even). Before we proceed we need to introduce the concept of equivalence between two observations: Deﬁnition 12.1 Observation Equivalence Two observations x(k) and x(l) are said to be equivalent if the input values that produce them according to 12.31 are identical: s(k − i) = s(l − i), for all i = 0, · · · , L − 1. Note that two equivalent observations are necessarily equal, but the converse may not be true. Indeed, it is possible that two equal observations x(k) = x(l), are produced by two diﬀerent strings of input symbols [s(k), · · · , s(k − L + 1)] = [s(l), · · · , s(l − L + 1)]. Consider four sets of (N + 1) consecutive observations from equation 12.31: Xj = {x(j), x(j + 1), · · · , x(j + N )}, Xk = {x(k), x(k + 1), · · · , x(k + N )}, Xl = {x(l), x(l+1), · · · , x(l+N )}, Xm = {x(m), x(m+1), · · · , x(m+N )}. Further assume that the pairs {x(j), x(k)} and {x(l), x(m)} are equivalent. Deﬁne σjki

=

[s(j + i) − s(k + i)]/2,

σlmi

=

[s(l + i) − s(m + i)]/2;

i = 1, · · · , N

(12.38)

and note that σjki , σlmi ∈ A0M = AM ∪ 0. Let the following conditions be true:

12.2

Finite Alphabet Sources

359

1. σjk1 , σlm1 are nonzero and coprime, i.e., their greatest common divisor is 1; 2. for all α, β ∈ AM

σjk1 α σlm1 = β ⇒ |α| = |σjk1 |, |β| = |σlm1 |;

3. for all α, β ∈ A0M , σjk1 σjki − α , = σlm1 σlmi − β

for all i = 2, · · · , N.

The method starts by identifying the ﬁrst ﬁlter tap h0 up to a sign, and continues by recursively identifying the remaining taps given the previous ones. Begin with the remark that x(j) and x(k) are equivalent, so [s(j), · · · , s(j−L+1)] = [s(k), · · · , s(k− L + 1)]. Then, the successor values of x(j), x(k), can be written as x(j + 1) = h0 s(j + 1) +

L−1

hi s(j + 1 − i),

i=1

x(k + 1) = h0 s(k + 1) +

L−1

hi s(k + 1 − i),

i=1

so, x(j + 1) − x(k + 1) = σjk1 h0 . 2

(12.39)

Similarly, for x(l + 1), x(m + 1): x(l + 1) − x(m + 1) = σlm1 h0 , 2

(12.40)

x(j + 1) − x(k + 1) σjk1 = . x(l + 1) − x(m + 1) σlm1

(12.41)

and so,

By condition 2, the ratio |σjk1 /σlm1 | is produced by a unique enumeratordenominator pair in AM . Thus both values σjk1 and σlm1 can be uniquely identiﬁed, up to a sign, leading to the magnitude estimation of h0 by: |h0 | =

|x(l + 1) − x(m + 1)| |x(j + 1) − x(k + 1)| = . 2|σjk1 | 2|σlm1 |

(12.42)

Without loss of generality, we may assume that h0 > 0, and proceed to the estimation of h1 as follows: Write the second successors of x(l), x(k), as x(j + 2) = h0 s(j + 2) + h1 s(j + 1) +

L−1 i=2

hi s(j + 1 − i),

360

Blind Signal Processing Based on Data Geometric Properties

x(k + 2) = h0 s(k + 2) + h1 s(k + 1) +

L−1

hi s(k + 1 − i),

i=2

hence, x(j + 2) − x(k + 2) = σjk2 h0 + σjk1 h1 . 2

(12.43)

x(l + 2) − x(m + 2) = σlm2 h0 + σlm1 h1 . 2

(12.44)

Similarly,

The pair of equations 12.43 and 12.44 involve three unknowns: σjk2 , σlm2 , h1 . However, it turns out that since the ﬁrst two unknowns come from the discrete set A0M and condition 3 is true, the solution is unique. Indeed, assume there existed (1) (1) (1) (2) (2) (2) two diﬀerent solutions {σjk2 , σlm2 , h1 }, {σjk2 , σlm2 , h1 }. Then by equations 12.43 and 12.44 we have (2)

(1)

(2)

(1)

(12.45)

(2)

(1)

(2)

(1)

(12.46)

(σjk2 − σjk2 )h0 = (h1 − h1 )σjk1 , (σlm2 − σlm2 )h0 = (h1 − h1 )σlm1 . Thus, (2)

(1)

σjk2 − σjk2 σjk1 = (2) , (1) σlm1 σlm2 − σlm2 which is impossible, according to condition 3. Therefore, there exists a unique solution to equations 12.43 and 12.44. From these equations it follows that

x(j + 2) − x(k + 2)

x(l + 2) − x(m + 2) − σlm2 h0 /σlm1 , − σjk2 h0 /σjk1 = h1 = 2 2 so the unique h1 can be obtained by ﬁnding the intersection between the sets x(j+2)−x(k+2) αh0 0 + ; α ∈ A F1 = M 2σjk1 σjk1 . x(l+2)−x(m+2) βh0 0 F2 = + ; β ∈ A M 2σlm1 σlm1 This is computationally trivial since the two sets are ﬁnite with few elements. Inductively, for hi , i > 2, and given the values for h0 , . . . , hi−1 , we form the

12.2

Finite Alphabet Sources

361

“deﬂated” successors i−1

x ¯(j + i + 1) = x(j + i + 1) −

σjk(p+1) hi+p

(12.47)

σjk(p+1) hi+p

(12.48)

σlm(p+1) hi+p

(12.49)

p=1

x ¯(k + i + 1) = x(k + i + 1) −

i−1 p=1

x ¯(l + i + 1) = x(l + i + 1) −

i−1 p=1

x ¯(m + i + 1) = x(m + i + 1) −

i−1

σlm(p+1) hi+p

(12.50)

p=1

and we obtain a set of two equations similar to 12.43 and 12.44: x ¯(j + i + 1) − x ¯(k + i + 1) = σjk(i+1) h0 + σjk1 hi , 2

(12.51)

x ¯(l + i + 1) − x ¯(m + i + 1) = σlm(i+1) h0 + σlm1 hi . 2

(12.52)

which are solved in a similar fashion, producing the unknown tap hi . Thus, the whole approach is summarized in the following algorithm Algorithm 12.3 Yellin and Porat Step 1: Collect T observation measurements. Step 2: Find pairs of equivalent measurements. Estimate h0 according to equation 12.42. Step 3: Estimate h1 using h0 and the pairs of equivalent observations. Step 4: Continue with the estimation of h2 , . . . , hn given the previous estimates. Step 5: Use the estimated impulse response to deconvolve the observation sequence and obtain the system input. Remark The choice of pairs of equivalent observations (step 2 in algorithm 12.3) is far from trivial. The indices j, k, l, m, must satisfy various constraints so that the assumptions of the method are met. First, according to condition A, we must have σjk1 , σlm1 = 0, implying that x(j + 1) = x(k + 1), x(l + 1) = x(m + 1). Second, according to condition C, for all i = 2, · · · , N the ratios σjki /σlmi should not be equal to σjk1 /σlm1 . A thorough discussion on the implementation details is in the original paper (Yellin and Porat, 1993). The method can be easily extended to handle complex input constellations (such as QAM) and/or complex ﬁlter taps. For the special case of i.i.d. input signals it is estimated that a suﬃcient batch size that guarantees E > 2 equivalent pairs of measurements is T = 2.44E 0.61 N M N/2 .

362

Blind Signal Processing Based on Data Geometric Properties

It is diﬃcult to satisfy condition C if the source alphabet is binary (AM = Aa ), because there is a limited choice for the values of σjki , σlmi , which belong to the set A0a = {−1, 0, 1}. 12.2.4.3

MISO Systems: Direct Source Extraction

The blind source extraction directly from the output of a multi-input single-output (MISO) system is treated in Diamantaras and Papadimitriou (2005). This work is an extension of earlier work on SISO systems (Diamantaras and Papadimitriou, 2004a). The key to the approach in both cases is the structure of the successor values of equivalent observations induced by the fact that the sources are binary. Subsequently, we shall present the results for the more general MISO case. Let us consider a multi-input single-output (MISO) model described by the following equation: x(k) =

L−1

hiT s(k − i),

(12.53)

i=0

where hi for i = 0, . . . , L−1, are a set of unknown real n-dimensional mixing vectors or ﬁlter taps. The source vector signal s(k) = [s1 (k), . . . , sn (k)]T is composed of n independent binary antipodal signals: si (k) ∈ Aa . The observations of the mixtures are real-valued scalars. For each k, the vector s(k) can take one of 2n values denoted (n) T

(n)

is the ith row of the matrix B(n) deﬁned by bi , i = 1, . . . , 2n . The vector bi in equation 12.17. Let us extend the concept of observation equivalence, deﬁned before for SISO systems, to MISO systems by simply replacing the scalar inputs with vector inputs. Each observation x(k) is generated by the linear combination of L n-dimensional source vectors, therefore, the observation space X ( x(k) is a discrete set consisting of, at most, 2M elements, M = nL. The cardinality |X | will be less than 2M if and (n) (n) (n) (n) only if there exist two diﬀerent L-tuples {bj0 , · · · , bjL−1 } and {bl0 , · · · , blL−1 }, &L−1 T (n) &L−1 T (n) of binary vectors such that i=0 hi bji = i=0 hi bli . The following avoids this situation: Assumption 12.1 Two observations x(k), x(l), are equivalent if and only if they are equal. Hence, |X | = 2M . In other words, to every observation value r ∈ X corresponds ¯ L−1 (r)} of consecutive source vectors that generates ¯ 0 (r), · · · , b a unique L-tuple {b this observation. No other observation value r ∈ X corresponds to the same L-tuple of binary vectors. For any x(k) = r, we have x(k) =

L−1

¯ i (r), hiT b

(12.54)

i=0

since, by deﬁnition, ¯ i (r) = s(k − i), b

for i = 0, · · · , L − 1.

12.2

Finite Alphabet Sources

363

Now the successor observation, x(k + 1), can be written as x(k + 1) = h0T s(k + 1) + = h0T s(k + 1) +

L−1 i=1 L−1

hiT s(k − (i − 1)) ¯ i−1 (r). hiT b

(12.55)

i=1

Since s(k + 1) is an n-dimensional binary antipodal vector, x(k + 1) can take one of the following 2n possible values: yp (r) = h0T b(n) p +

L−1

¯ i−1 (r), hiT b

p = 1, · · · , 2n .

(12.56)

i=1

Note that the successor values yp (r) do not depend on the speciﬁc time index k but only on the observation value r. Therefore, each observation value r creates a class of &2n (n) n successors Y(r) with cardinality |Y(r)| = 2 . Furthermore, we have p=1 bp = 0, so the mean y¯(r) of the members of Y(r) is n

2 1 yp (r) y¯(r) = n 2 p=1 L−1 2n 1 n ¯ i−1 (r) = n h0T b(n) hiT b p +2 2 p=1 i=1

=

L−1

¯ i−1 (r). hiT b

(12.57)

i=1

Now, let us replace every x(k) = r with the mean y¯(r) to obtain a new sequence x(2) (k) = x(2) (k) =

L−1 i=1 L−1

¯ i−1 (r) hiT b hiT s(k − i + 1).

(12.58)

i=1

The new MISO system 12.58 has the same taps as the original system 12.53 except that it is shorter since h0 is missing. We will say that the system has been deﬂated. An additional but trivial diﬀerence is that the source sequence is time-shifted. Based on the discussion above, the whole ﬁlter- or system-deﬂation method, is summarized as follows: Algorithm 12.4 System Deﬂation Step 1: For every r ∈ X locate the set of time instances K(r) = {k : x(k) = r}. Step 2: Find the successor set Y(r) = {x(k + 1) : k ∈ K(r)}. This set must contain 2n distinct values y1 (r), . . . , y2n (r). &2n Step 3: Compute the mean y¯(r) = 1/2n i=1 yi (r). Step 4: Replace x(k) by y¯(r), for all k ∈ K(r).

364

Blind Signal Processing Based on Data Geometric Properties

Clearly, for this method it is essential that all observation-successor pairs [r, yi (r)], i = 1, · · · , 2n will appear, at least once, in the output sequence x. Applying the deﬂation method L−1 times, the system will be eventually reduced to an multiinput single-output instantaneous problem: T s(k − L + 1). x(L) (k) = hL−1

(12.59)

The BSS problem of the type in 12.59 has been treated in section 12.2.1. The main disadvantage of this method stems from the assumption that the data set must contain every possible observation-successor pair. As the size of the MISO system increases this assumption requires exponentially larger observation data sets. An alternative approach starts by observing that for any r ∈ X the centered successors, (n)

ci = yi (r) − y¯(r) = h0T bi

i = 1, · · · , 2n ,

(12.60)

are independent of r. Thus every observation has the same set of centered successors. We shall refer to the set C = {ci ; i = 1, · · · , 2n } as the centered successor constellation set of system 12.53. C can be easily computed by ﬁrst obtaining Y(r), for any r, and then subtracting the mean y¯(r) from each element yi (r) ∈ Y(r). Note that C is symmetric in the sense that c ∈ C ⇔ −c ∈ C. Now, for every observation value r = x(k) ∈ X we have x(k) = h0T s(k) + (n)

r = h0T bi

+

L−1 l=1 L−1

hlT s(k − l) ¯ l (r), hlT b

(12.61)

some i

(12.62)

l=1

= ci +

L−1

¯ l (r), hlT b

some i

(12.63)

l=1

Furthermore, due to the symmetry of the constellation set, there exists a “dual” observation value rd ∈ X such that rd = −ci +

L−1

¯ l (r) hlT b

(12.64)

l=1

rd = r − 2ci .

(12.65)

Assume that for every observation r ∈ X , there exists a unique index j ∈ {1, · · · , 2n } such that r − 2cj ∈ X . Then the dual value rd can be identiﬁed by testing all r − 2cj , j = 1, · · · , 2n , for membership in the observation space X . Let us now replace x(k) by the average of r, rd , to obtain (2)

x ˜

(k) = (r + r )/2 = r − cj = d

L−1 l=1

¯ l (r). hlT b

(12.66)

12.2

Finite Alphabet Sources

365

Note that bl (r) = s(k − l), so x ˜(2) (k) =

L−1

hlT s(k − l).

(12.67)

l=1

Equation 12.67 describes a new, shortened MISO system, Assumption 12.2 For only one r0 ∈ X , there exist at least 2n ki , i = 1, · · · , 2n ∈ {1, 2, · · · , K} such that x(ki ) = r0 , x(ki + 1) = σi (r0 ), i = 1, · · · , 2n . In addition to that, every possible value of X exists at least once in the data set. Summarizing the above results, our second method for obtaining the deﬂated system 12.67 is described below: Algorithm 12.5 System Deﬂation 2 Step 1: Locate an observation value r0 for which 2n distinct successors σi (r0 ), i = 1, · · · , 2n , exist in the data set. Step 2: Compute the successor constellation set C according to equation 12.60. Step 3: For every observation r = x(k) ﬁnd the (unique) value j for which r − 2cj ∈ X . Step 4: Replace x(k) by r − cj . Again, the L − 1 times repetition of this algorithm will reduce the system into a memoryless one, T s(k), x ˜(L) (k) = hL−1

(12.68)

which can be treated as described in the previous section on MIMO systems. Example 12.5 MISO System Identiﬁcation and Source Separation We shall demonstrate the application of the second method via a speciﬁc example. Assume that we observe the output x(k) of a two-input one-output system (ﬁg. 12.9a). The system has two binary inputs s1 , s2 , convolution length L = 3, and ﬁlter taps h1 = [−0.9024, 1.5464]T , h2 = [−0.6131, 0.7166]T , h3 = [−0.4131, −0.1621]T . The output constellation contains 2nL = 64 clusters: X = {±4.3537, ±4.0295, ±3.5275, · · · }. Already, the ﬁrst value r = −4.3537 has 2n = 4 distinct successors in the output sequence x(k). From those successor values the centered successor constellation set is easily computed to be C = {−2.4488, −0.6440, 0.6440, 2.4488}. After the deﬂation steps 2 and 3 we obtain a new sequence x(2) (k) (ﬁg. 12.9b). Now the output constellation contains 2n(L−1) = 16 clusters: X (2) = {±1.9049, ±1.5807, ±1.0787, · · · } and the centered successor constellation set is C (2) = {−1.3297, −0.1035, 0.1035, 1.3297}. We use this set to obtain a second deﬂated signal x(3) (k) (ﬁg. 12.9c). This signal actually corresponds to an instantaneous mixture of the two sources. The output

366

Blind Signal Processing Based on Data Geometric Properties

5

x

0

−5

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

2

x

(2)1 0 −1 −2

1 0.5

(3)

x

0 −0.5 −1

k Figure 12.9 (Top) Output signal from a two-input-one-output FIR system of length L = 3. The output constellation contains 2nL = 64 distinct clusters. (Middle) First deﬂated signal with 2n(L−1) = 16 clusters. (Bottom) Second deﬂated signal with 2n(L−2) = 4 clusters. The last signal corresponds to an instantaneous mixture

of the two sources. constellation has only four clusters: X (3) = {−0.5752, −0.2510, 0.2510, 0.5752}. We may apply algorithm 12.1 to obtain an estimate of the mixing parameters and of ˆ 3,2 = 0.1621 = −h3,2 . ˆ 3,1 = 0.4131 = −h3,1 , h the input signals as well. We obtain h Subsequently performing the optimization (eq. 12.15) for the estimation of the sources we get perfect reconstruction (except for the sign): sˆ1 (k) = −s1 (k), sˆ2 (k) = −s2 (k).

12.3

Continuous Sources In section 12.2 we exploited the constellation structure of signals generated by linear systems with ﬁnite alphabet inputs. In many applications, however, the range of values of the source data is continuous. In this case the geometrical properties of the signals can still be exploited to derive eﬃcient deterministic blind separation methods provided that the sources are sparse, or the input distribution is bounded, or the number of observations is m = 2.

12.3

Continuous Sources

12.3.1

367

Early Approaches: Two Mixtures, Two Sources

It is possible to generalize the geometric properties of binary signals described in section 12.2.1, when the sources symbols are bounded. We start with the simplest case of two instantaneous mixtures x1 , x2 , and two sources s1 , s2 , (m = n = 2): x(k) = [x1 (k), x2 (k)]T = h1 s1 (k) + h2 s2 (k)

(12.69)

We shall describe two of the earliest and most characteristic methods by Puntonet et al. (1995) and Mansour et al. (2001). The Method of Puntonet et al. (1995) The geometry of mixtures of binary signals bears similarity to the geometry of mixtures of bounded sources. Consider the mixing model 12.69 and let s1 (k), s2 (k) ∈ [−B, B]. The linear operation of equation 12.69 transforms the original square source constellation (ﬁg. 12.10a) into a parallelogram-shaped constellation with edges parallel to the vectors h1 and h2 (ﬁg. 12.10b). The blind identiﬁcation task is then equivalent to ﬁnding the edges of the convex hull of the output constellation. Puntonet et al. (1995) proposed a simple procedure for doing that. This procedure is composed of two steps: Step 1: Locate the outmost corner xO of the parallelogram by ﬁnding the observation with the maximum norm: xO = x(k0 ), k0 = arg maxk { x(k) 2 }. Step 2: Translate the observations x (k) = x(k) − x(k0 ) such that xO becomes the origin, and compute the slopes of the parallelogram by computing the minimum and maximum ratios: rmin = mink (x2 (k)/x1 (k)), rmax = maxk (x2 (k)/x1 (k)). These are the ratios h12 /h11 , h22 /h21 , not necessarily in that order. Once the slopes of the edges are determined, the mixing matrix is estimated by 1 1/rmin ˆ . (12.70) H= rmax 1 Since (rmin , rmax ) = (h12 /h11 ) or (h22 /h21 ) we have, 1 h /h /h 1 h 21 22 11 12 ˆ = or . H h12 /h11 1 h22 /h21 1 Remember now that

H = [h1 , h2 ] =

h11

h21

h12

h22

,

(12.71)

Blind Signal Processing Based on Data Geometric Properties

1.5 1 0.8 1 0.6 0.4

0.5

2

0

x

2

0.2

s

368

0 −0.2

−0.5

−0.4

h

h

2

−0.6

1

−1 −0.8 −1 −1.5

−1.5

−1

−0.5

0

0.5

1

1.5

−1

−0.5

0

s1

0.5

1

x1

(a)

(b)

Figure 12.10 (a) Source constellation for two independent sources uniformly distributed between −1 and 1. (b) Output constellation after a 2×2 linear, memoryless transformation of the sources in panel a.

so, ˆ =H H

1/h11

0

0

1/h22

ˆ =H or H

0

1/h12

1/h21

0

.

ˆ −1 x(k) will be In either case, the source estimate ˆ s(k) = H ˆ s(k) = [h11 s1 (k), h22 s2 (k)]T or ˆ s(k) = [h12 s2 (k), h21 s1 (k)]T .

(12.72)

Thus, the estimated sources will be equal to the true ones except for the usual unspeciﬁed scale and order. Note that the method works even if the source pdf is semibounded, for example, bounded only from below. In that case the parallelogram is open-ended but the visible corner is suﬃcient for identifying the two slopes. The main drawbacks of this approach are two: it cannot generalize to more sources or observations, and it will not work if the source pdf is not bounded (for example, Gaussian, Laplace, etc).

The Method of Mansour et al. (2001) Another simple procedure for the solution of the 2 × 2 instantaneous BSS problem has been proposed by Mansour et al. (2001). The transformation s → x described by equation 12.69 represents a skew, rotation, and scaling of the original axes in 2 dimensions. The ﬁrst step of the procedure is to remove the skew by prewhitening x using the covariance matrix Rx = E{x(k)x(k)T }. If Rx = Lx LTx is the Cholesky factorization of Rx , let z(k) = L−1 x x(k).

(12.73)

12.3

Continuous Sources

369

The mapping x → z is called prewhitening transformation because the output −T = I. The prewhitening vector z(k) is white: Rz = {z(k)z(k)T } = L−1 x Rx Lx transformation (eq. 12.73) makes the axes become orthogonal again, but the rotation and the scaling remains. The next step is to compensate for the rotation by computing the angle θ of the furthermost point of the constellation of z from the origin. We consider two cases: The sources are uniformly distributed, say, between −1 and 1 (ﬁg. 12.11a). The source constellation is a square and the angle θ corresponds to a corner of the square. Therefore, in order to compensate for θ, the corner should return to its original position at π4 . This is achieved by the following orthogonal transformation: cos( π4 − θ) − sin( π4 − θ) y(k) = z(k). (12.74) sin( π4 − θ) cos( π4 − θ) The sources are super-Gaussian, i.e., kurt(si ) = E[s4i ] − 3(E[s2i ])2 > 0, i = 1, 2 (ﬁg. 12.11b). The constellation of s in this case is “pointy” along the directions [±1, 0] and [0, ±1]. The angle θ corresponds to one of the “hands” of the X-shaped constellation for x. Clearly, θ should be reduced to 0. This is done by the following rotation transformation: cos(−θ) − sin(−θ) y(k) = z(k). (12.75) sin(−θ) cos(−θ) In both cases there remains an unknown scaling of the sources which cannot be removed since it is unobservable in all BSS problems. 12.3.2

Sparse Sources, Two Mixtures

Another special case of continuous sources that can be successfully treated using geometric methods is the case of sparse sources. A signal si (k) is sparse if it is equal to zero most of the time. The sparseness of the si is measured by the sparseness probability pS (si ) = Pr{si (k) = 0}. Values of pS closer to 1 correspond to more sparse data, whereas values closer to 0 represent dense data. Consider now the typical instantaneous mixing model: x(k) = Hs(k),

(12.76)

assuming that all the sources are sparse. Then it is highly likely that there exist some time instances such that only one source is active at that instance. If, for example, only si is nonzero at time k, then x(k) is proportional to hi , the ith column of H. The number of outputs m is not important, as long as m ≥ 2. In fact, the number of outputs may even be less than the number of sources (m < n). In the subsequent discussion we shall use the convenient value m = 2 because it will

370

Blind Signal Processing Based on Data Geometric Properties

1.5

1

1 0.5

s2 0.5

x2

0

0

−0.5 −0.5 −1 −1.5 −2

−1

0

1

−1

2

−1

−0.5

s1

2

y2

0

0.5

1

x1

2

π/4

1

z2

0

−1

1

θ

0

−1

−2

−2 −2

0

2

−2

0

y1

(a)

2

z1

50

20 10

s

2

x 0

2

0

−10 −50

−50

0

−20

50

−20

−10

s

1

(b)

10

5

z

0

0 −5

−10

−10 −10

0

y

1

10

20

θ

5

2

−5

−15 −20

20

15

10

2

10

1

15

y

0

x

−15 −20

−10

0

10

20

z

1

The linear, instantaneous transformation s → x introduces skew, rotation, and scaling on the original axes. The whitening transform x → z removes the skew, making the axes orthogonal again. Then the rotation can be removed by an orthogonal transformation z → y. (a) If the source distribution is uniform we must rotate so that θ becomes π/4. (b) If the source distribution is super-Gaussian then we must rotate so that θ becomes 0.

Figure 12.11

Continuous Sources

371

8

8

6

6

6

4

4

4

2

2

2

0

x

0

x

0

2

8

2

x2

12.3

−2

−2

−2

−4

−4

−4

−6

−6

−8 −8

−6

−4

−2

0 x1

2

4

6

8

−6

−8 −8

−6

−4

(a)

−2

0 x1

2

4

6

8

−8 −8

−6

−4

(b)

90

90

90 120

60

4

6

8

60

400

300

2

400

500 120

60 400

300

300

150

150

30

150

30

200

30

200

200

100

100

100

180

0

210

330

240

0 x1

(c)

500 120

−2

300

180

0

210

330

240

300

180

0

210

330

240

300

270

270

270

(d)

(e)

(f)

Output constellation for m = 2 outputs and n = 4 sparse sources. The top three plots correspond to diﬀerent sparseness probabilities (a) pS = 0.6, pS = 0.7, (c) pS = 0.8. The solid lines are the directions of the four vector-columns of H . The three bottom ﬁgures d, e, and f are polar plots of the data density (potential) function with spreading parameter σ = 8 corresponding to the constellations a, b, and c, respectively. Figure 12.12

help us visualize the results. Boﬁll and Zibulevsky (2001) observed that the data are clustered along the directions of the mixing vectors hi , i.e., the columns of H. Figure 12.12 shows the output constellation for the memoryless system (eq. 12.76) with m = 2 outputs, n = 4 sparse inputs, and diﬀerent sparseness levels. As the sparseness of the inputs increases, the four clustering directions become more easily identiﬁable (see ﬁgures 12.12a,b,c). Thus blind system identiﬁcation is achieved by identifying the directions of maximum data density. Assuming that the sources are zero mean, so they can take both positive and negative values, the clustering will extend to the negative directions −hi as well. Since, for each i, both opposing directions hi and −hi are equally probable, it is not possible to identify the “true” vector. This is a manifestation of the sign ambiguity which is inherent to the BSS problem. Not surprisingly, the ordering ambiguity is also present in the sense that there is no predeﬁned order on the directions of maximum data density.

372

Blind Signal Processing Based on Data Geometric Properties

For m = 2, a practical algorithm has been proposed by Boﬁll and Zibulevsky (2001). For any two-dimensional vector x = [x1 , x2 ]T = 0 let us deﬁne the angle of x, θ(x) = arctan(x2 /x1 ).

(12.77)

The directions θ where the random variable θk = θ(x(k)) has the highest density are the directions of the mixing vectors. The density is estimated by the use of a potential function U (θ) : w(k) t(θ − θk ; σ) (12.78) U (θ) = k

' t(α; σ) =

1−

α π/(4σ) ,

0,

for |α|

3 are possible but impractical due to the high computational cost and the extremely large required data sets. In an analogous way to the sparse case, the method is based on the properties of the density (pdf) ρΘ ¯ of the random variable θ¯ = θ(x)

mod π,

where θ(x) is the angle of x deﬁned in equation 12.77. Here, however, the peaks of the density may not have a one-to-one correspondence with the mixing vectors, especially when the number of sources is greater than the number of observations (n > m). The basic result is that the angles θi = θ(hi ),

i = 1, · · · , n

(12.82)

of the mixing vectors hi satisfy the geometric convergence condition (GCC) deﬁned below: Deﬁnition 12.2 Geometric Convergence Condition The set of angles {θ1 , . . . , θn }, θi ∈ [0, π), satisﬁes the GCC if, for each i, θi is the median of ρY restricted to the receptive ﬁeld Φ(θi ). Deﬁnition 12.3 Receptive Field For a set of angles {θ1 , . . . , θn }, θi ∈ [0, π), the receptive ﬁeld Φ(θi ) is the set consisting of the angles θ closest to θi : Φ(θi ) = {θ ∈ [0, π) : |θ − θi | ≤ |θ − θj | for all j = i}. Since the angles of the true mixing vectors satisfy the GCC we hope that we can ﬁnd them by devising an algorithm which converges when the GCC is satisﬁed. This is exactly the aim of the geometric ICA algorithm (Theis et al., 2003a,b). This iterative algorithm works with a set of n unit-length vectors (and their opposites) and terminates only when the angles of these vectors are the medians of their corresponding receptive ﬁelds. It is conjectured that the only stable points of this algorithm are the true mixing vectors. The algorithm starts by picking n random pairs of opposing vectors: {wi (0), wi (0) = −wi (0)}, i = 1, · · · , n. For each iteration k, a new observation vector x(k) is projected onto the unit circle: z(k) =

x(k) . x(k)

Then we locate the vector wj (k) closest to z(k) and we update the pair wj (k),

12.3

Continuous Sources

375

wj (k), as follows: wjtemp

z(k)−w (k)

= wj (k) + η(k) z(k)−wjj (k) ,

wj (k + 1)

= wjtemp / wjtemp ,

wj (k

= −wj (k + 1).

+ 1)

(12.83)

The other w’s are not updated in this iteration. It can be shown that the set W = {w1 (∞), . . . , wn (∞)} is a ﬁxed point of this algorithm if and only if the angles θ(w1 (∞)), . . . , θ(wn (∞)) satisfy the GCC. We already know that the set A = {θ(h1 ), . . . , θ(hn )} satisﬁes the GCC, therefore, we hope that, at convergence, {θ(w1 (∞)), . . . , θ(wn (∞))} = A. If this is true then the vectors w1 (∞), . . . , wn (∞) are parallel to the mixing vectors h1 , . . . , hn , although not necessarily in that order. Since the order and scale are insigniﬁcant, this is not ˆ −1 = [w1 (∞), . . . , wn (∞)]−1 a problem. If m = n, then the estimated matrix H solves the BSS problem. In the overdetermined case (m > n) the general algorithm for the source recovery is the maximization of P (s) under the constraint x = Hs. This linear optimization problem can be approached using various methods, such as, for example, the one described in section 12.3.2. The FastGEO Algorithm An alternative way to ﬁnd the mixing vectors is to design a function which is zero exactly when its arguments satisfy the GCC. Then we simply have to compute the zeros of this function, for example, by exhaustive search. This approach describes the so-called FastGEO algorithm (Jung et al., 2001; Theis et al., 2003a). Let us separate the interval [0, π) into n subintervals with separating boundaries φ1 , . . . , φn , and let θi be the median of θ¯ in the subinterval [φi , φi+1 ], FΘ ¯ (φi ) + FΘ ¯ (φi+1 ) −1 θ i = FΘ , i = 1, · · · , n, (12.84) ¯ 2 ¯ F ¯−1 is the inverse function where FΘ ¯ is the cumulative distribution function of θ, Θ of FΘ ¯ (we assume it exists), and φn+1 = φ1 + π (see ﬁg. 12.14). Then the function T θ 1 + θ2 θn−1 + θn (n) μ (φ1 , · · · , φn−1 ) = − φ2 , · · · , − φn (12.85) 2 2 is zero if and only if θi + θi+1 = φi+1 , 2

i = 1, · · · , n − 1

for all i, and so by deﬁnition the receptive ﬁeld Φ(θi ) is exactly the subinterval [φi , φi+1 ] and θi is the median of its receptive ﬁeld; in other words, the set {θi , . . . , θn } satisﬁes the GCC. For each set of separating boundaries {φ1 , . . . , φn−1 } we compute the medians θ1 , . . . , θn by equation 12.84 and then the function μ(n) (φ1 , · · · , φn−1 ) by equation 12.85. The FastGEO algorithm is the exhaustive search for the zeros of μ(n) .

376

Blind Signal Processing Based on Data Geometric Properties

φ3

1

θ

2

φ

2

θ1 φ1

0.5

θ

3

0

φ

4

−0.5

−1

−1.5

−1

−0.5

0

0.5

1

1.5

The angles θi of the mixing vectors hi satisfy the geometric convergence condition if they are the median of the random variable θ(x) within the interval Φ(θi ) = [φi , φi+1 ]. Φ(θi ) is called the receptive ﬁeld of θi and it is the set consisting of the angles θ closest to θi .

Figure 12.14

Especially for n = 2 we let φ1 = φ and we have φ2 = φ + π/2, so μ(2) (φ) =

θ1 + θ2 − (φ + π/2). 2

Example 12.7 Let x1 , x2 be two instantaneous mixtures of two uniform sources s1 , s2 . The mixtures were generated by the following mixing operator 0.0735 0.2913 H= . −0.3391 0.3725 The distribution of the angle y = φ(x) is shown in ﬁg. 12.15. The same ﬁgure shows the receptive ﬁeld boundaries {φ1 , φ2 , φ3 } = {77.0998, 167.0998, 257.0998} (in degrees), corresponding to the angles {θ1 , θ2 } = {51.9759, 102.2237} of the mixing vectors h1 = [0.0735, −0.3391]T , h2 = [0.2913, 0.3725]T . The angles θ2 and θ1 + 180 are the medians of the angle distribution in the corresponding receptive ﬁelds.

12.4

Conclusions Blind signal processing (BSP) refers to a wide variety of problems where the output of a system is observable but neither the system nor the input is known. The large family of BSP problems includes blind signal separation (BSS), blind system or

Conclusions

377

0.012

φ1

φ2

φ3

0.01

Probability density

12.4

0.008

0.006

0.004

0.002

θ 0

1

0

50

θ

θ +180

2

100

θ +180

1

150

200

2

250

300

350

400

Angle (degrees)

The distribution of the angle θ(x) for two instantaneous mixtures of two uniformly distributed sources. The receptive ﬁeld boundaries are deﬁned by the angles φi . The angles θi of the mixing vectors are the medians of each receptive ﬁeld.

Figure 12.15

channel identiﬁcation (BSI or BCI), and blind deconvolution (BD). Traditional approaches exploit statistical properties of second or higher order. Recently a third approach has emerged using the geometric properties of the data cloud. This approach exploits the ﬁnite alphabet property of the input data or the shape of the constellation depending on the probability density of the sources. In such an approach our basic tools are methods for data clustering and shape description such as the convex hull. The advantage of the geometric approach is the ﬁnite nature of the methodology following the clustering step. Typically, this methodology is fast for small problem sizes, i.e., for few sources or short channels. The main disadvantage is the combinatorial explosion which is incurred when the problem size grows large. To combat this drawback, channel-shortening methods may come to our assistance. The problem, however, is far from solved and many issues remain open. In this chapter we presented the main geometric principles used in blind signal processing. We presented a comprehensive literature survey of geometric methods and we outlined the basic methods for blind source separation, blind deconvolution, and blind channel identiﬁcation.

13

Game-Theoretic Learning

Geoﬀrey J. Gordon Whatever games are played with us, we must play no games with ourselves. —Ralph Waldo Emerson

one-step vs. sequential

(im)perfect information

A game is a description of an environment where multiple decision makers can interact with one another. Each of the decision makers, called a player, may have its own goals; these goals may align with the goals of other players, conﬂict with them, or some combination of the two. Some traditional examples of games are bridge, blackjack, chess, roulette, and poker. Less traditional examples include auctions, marketing campaigns, decisions about where to build a new factory, and various types of social interactions such as applying for a job. Finally, many popular games mix components of perception and physical skill with the problem of making good decisions; examples include football, freeze tag, paintball, and driving in traﬃc. This chapter is about how to learn to play a game. We will discuss how a player can, by repeated interaction with its environment and with the other players, discover how to make decisions which achieve its goals as reliably as possible. Playing a real-life game such as football is far beyond the capability of any current artiﬁcial learning system, but we will at least begin to address the issues of exploration and generalization which arise in such a problem. The diﬃculty of making good decisions in a game can range from trivial to nearly impossible. We can classify games according to several dimensions; each of these classiﬁcations aﬀects the type of solution we can seek, the algorithms we can use, and the diﬃculty of ﬁnding a good plan of action. In one-step games each player decides on its strategy all at once. In sequential games a player commits to its actions in several steps, and after each step it may ﬁnd out something about the other players’ choices. Of course, to learn about any game the players will usually need to play it several times; so, we can speak of a repeated game, either one-step or sequential. In perfect information games, each player knows all the choices that the other players have already made. In games with imperfect information, the players know

380

Game-Theoretic Learning

(in)complete information

13.1

only some of the past choices of other players. For example, if an auction house is selling several copies of the same item one after another via sealed bids, the bidders will know the sale prices of the previous items but will not know the details of the previous bids. In complete information games the players know all the details of the game they are playing: they know the structure of the game, the outcomes of all past external events that are relevant to their future payoﬀs, and what the payoﬀs are in any situation for themselves and for the other players. In incomplete information games some of the players are missing some of this information; for example, in bridge or poker the players don’t see each other’s cards. In this chapter we will discuss all of these diﬀerent types of games in turn. Each one presents diﬀerent diﬃculties, so we will discuss various algorithms for learning to play them. Each of the algorithms provides diﬀerent performance guarantees, so we will describe and compare the types of guarantees that are available. We will start with one-step games. The standard representation of one-step games is the normal form, described in section 13.1. Given the normal form, classical game theory looks for distributions of play from which no player can alter its actions to improve its payoﬀs; such distributions are called equilibria, and we will discuss them in section 13.2. Equilibria are possible outcomes of learning, since learning players will not be satisﬁed as long as they think they can improve their reward. So, in section 13.3, we will review learning algorithms for one-step games and use the diﬀerent types of equilibria to describe what happens when various learning algorithms play against one another. From one-step games we will move to sequential games. In sequential games we can deﬁne additional types of equilibria, and we need to move to more complicated learning algorithms. Sections 13.4 and 13.5 cover these new equilibria and learning algorithms. Finally, we will conclude in section 13.6 with some examples of how game-theoretic learning algorithms have been applied to solve real-world problems. These problems range from poker to robotic soccer.

Normal-Form Games Any game can in principle be described with the following information:

Normal form representation

A list of the players. We will assume that there are only ﬁnitely many of them. For each player, a list of the actions (also called plays or pure strategies) that it may choose. We will assume that there are ﬁnitely many pure strategies. Any probability distribution over pure strategies is called a mixed strategy. Given a strategy proﬁle (that is, a pure strategy for every player), the utility or payoﬀ which each player assigns to the resulting outcome. If there are external random events which aﬀect utility, we only need to know the expected utility of each strategy proﬁle.

13.1

Normal-Form Games

381

This representation is called the normal form of the game. As with any general representation, the normal form may be nowhere near the most concise way to describe a game. Still, it does allow us to discuss many diﬀerent games and algorithms in a general way. We can represent a normal-form game with a table: there is one entry in the table for each strategy proﬁle, and the entry is a vector which lists the utility of the resulting outcome for each player. For example, the following table represents the children’s game rock-paper-scissors:

common knowledge

R

P

S

R

0, 0

−1, 1

1, −1

P

1, −1

0, 0

−1, 1

S

−1, 1

1, −1

0, 0

The ﬁrst entry on the second row of this table says that, if the row player chooses P (for “paper”) while the column player chooses R (for “rock”), then the row player gets a payoﬀ of 1 while the column player gets a payoﬀ of −1. The payoﬀs can in general be random variables, but we are only interested in their expected values, so we will not bother to write out any other properties of their distributions. The payoﬀ table is assumed to be common knowledge. That is, every player knows it, every player knows that every player knows it, and so forth. If the payoﬀ table is not common knowledge (that is, if some players have information about it that others don’t), the game is called a Bayesian game; section 13.4.1 covers Bayesian games in more detail. The simplest type of game to reason about is a two-player constant-sum game. Constant sum means that, for each strategy proﬁle (i.e., for each entry in the table), the sum of the payoﬀs to the two players is constant. For example, rock-paperscissors is a constant-sum game, since each utility vector sums to zero. Constantsum normal-form games are one of the few types of game with a universally accepted and easy to compute solution concept (the minimax equilibrium; see section 13.2.1). If the payoﬀs do not sum to a constant, or if there are more than two players, the game is called general sum. By convention a game with three or more players is always called general sum, even if the payoﬀs do sum to a constant, since minimax equilibrium doesn’t make sense for multiplayer games. Environments in which all players have the same payoﬀs are called cooperative or team games. Team games may appear easy, but they can be diﬃcult to solve because of imperfect or incomplete information: while a player may believe that a particular strategy proﬁle is best, it may not be able to trust the other players to agree. So, it may have to choose an action which appears suboptimal in order to try to reach a diﬀerent but safer outcome.

382

13.2

Game-Theoretic Learning

Equilibrium

common knowledge of rationality

types of equilibrium

safety value

An equilibrium of a game is a self-reinforcing distribution over strategy proﬁles. That is, if it is common knowledge that all players are acting according to a given equilibrium, no one player wants to change how it plays. There are many diﬀerent types of equilibrium, which diﬀer in how they formalize the above deﬁnition. Classical game theory takes the view that the best way to analyze a game is to determine what its equilibria are. We can justify this view if we assume that the game is common knowledge among the players, and that the players have common knowledge of each other’s rationality. However, equilibria don’t tell us everything if the players have limited computation (often called bounded rationality) or if some players disagree about the rules of the game. Also, a single game may have many equilibria, and it may not be clear how the players should (or can) select one. In this chapter we take a slightly diﬀerent view: we are more concerned with how a player may adapt its actions based on information about what the other players are doing. So, we will write down learning algorithms and analyze what happens when the players use these algorithms in diﬀerent types of games. Still, the various ideas of equilibrium are important: for example, under appropriate circumstances some of the learning algorithms we describe below will converge toward various types of equilibrium play. We have already mentioned one type of equilibrium, the minimax or von Neumann equilibrium for constant-sum matrix games. For general-sum games, there are at least two important types of equilibrium: the Nash equilibrium and the correlated equilibrium. And we will see even more types of equilibrium when we discuss sequential decision making in section 13.4 below. In addition to the Nash and correlated equilibria, it is sometimes helpful to know the safety value of each player in the game. A player’s safety value is the best payoﬀ that it can guarantee itself no matter what the other players do. That is, even if the other players irrationally ignore their own payoﬀs, they cannot force the ﬁrst player to accept less than its safety value. In any equilibrium, each player must have a payoﬀ at least as high as its safety value: if it did not, it would switch to its safety strategy. 13.2.1

Minimax Equilibrium

In a minimax equilibrium the players are required to choose independent probability distributions over their strategies, say x for the row player and y for the column player. If the payoﬀ to the row player is r(x, y), then the minimax value of the game for the row player is min max r(x, y) y

x

(13.1)

and a minimax strategy for the row player is any value of x which achieves the maximum in 13.1. Neither player will wish to deviate from its set of minimax

13.2

Equilibrium

383

strategies, since any deviation will give the other player a strategy which gets strictly better than the minimax payoﬀ. Any constant-sum matrix game has at least one minimax equilibrium. The set of these minimax equilibria is convex, all minimax equilibria have the same value, and miny maxx r(x, y) = maxx miny r(x, y). In the game of rock-paper-scissors described above, there is exactly one minimax equilibrium: both players play R, P , and S with equal probability. By taking the expectation over the diﬀerent possible outcomes (each of the nine proﬁles RR, RP , RS, . . . has probability 1/9) we can see that the average payoﬀ is zero for both players. On the other hand, if one of the players deviates from the equilibrium, the other player has a response that nets better payoﬀ: for example, if the row player picks R with probability 1/2 and P and S with probability 1/4 each, then the column player can choose P all the time. The column player will then get an expected payoﬀ of (1/2)1 + (1/4)0 + (1/4)(−1) = 1/4 > 0. 13.2.2

Nash equilibrium

In general-sum games the idea of minimax equilibrium no longer makes sense, so we must seek alternative types of equilibrium. Perhaps the best-known type of equilibrium for general-sum games is the Nash equilibrium. In a Nash equilibrium we require the players to choose independent distributions over strategies. So, a Nash equilibrium is a proﬁle of strategy distributions such that, if we hold the distributions ﬁxed for every player except one, the remaining player can get no beneﬁt by changing its play. There is always at least one Nash equilibrium for every game, and there may be many. In a constant-sum game, Nash equilibria are the same as minimax equilibria. But in a general-sum game, a player’s payoﬀ may diﬀer greatly from one Nash equilibrium to another, and the set of Nash equilibria may be nonconvex and diﬃcult to compute. To illustrate Nash equilibria, consider the game of “Battle of the Sexes”. In this game, a husband and wife want to decide whether to go to the opera (O) or the football game (F). One of them (the row player) prefers opera, while the other (the column player) prefers football. But, they also prefer to be together; so, they have the following payoﬀs: O

F

O

4, 3

0, 0 .

F

0, 0

3, 4

This game has three Nash equilibria. Two of them are deterministic: both players go to the opera, or both go to football. The last one is mixed: the row player picks opera 3/7 of the time, while the column player picks opera 4/7 of the time.

384

Game-Theoretic Learning

(In this mixed strategy, each player’s distribution makes the other player perfectly indiﬀerent about whether to pick opera or football.) 13.2.3

Correlated Equilibrium

In real life many people would solve the Battle of the Sexes by ﬂipping a coin to decide which event to go to. This strategy is not a Nash equilibrium: in a Nash equilibrium the players are not allowed to communicate before the game, so they cannot both see the same coin ﬂip. Instead it is a correlated equilibrium, which is like a Nash equilibrium except that we drop the requirement of independence between the players’ distributions over strategies. More formally, consider a distribution P over the set of strategy proﬁles; P may contain arbitrary correlations between the strategies of the diﬀerent players. Some external mechanism, which we will call the moderator, selects a strategy proﬁle x according to P and reports to player i the action xi that it is supposed to follow. P is a correlated equilibrium if player i has no incentive to play anything other than xi (even after ﬁnding out that xi was recommended, which may tell it something about the other players’ strategies). The coin-ﬂip strategy is a correlated equilibrium in which the distribution P places weight 1/2 on each of the proﬁles OO and F F . Another everyday example of a correlated equilibrium is a traﬃc light: we can model the light as being randomly red or green as we approach it.1 Red is the moderator’s recommendation to stop, while green means to go through the intersection without stopping. Given a red light it is not worth going through the intersection and risking a crash, while with a green light we can assume that the traﬃc on the cross street will stop and our best strategy is to maintain speed. 13.2.4

Equilibria in Battle of the Sexes

To gain more intuition for Nash and correlated equilibria we will illustrate how to compute them for the Battle of the Sexes. We will start by computing the correlated equilibria, which satisfy a set of linear equality and inequality constraints; we will then obtain the Nash equilibria by adding in some nonlinear constraints. We can describe a correlated equilibrium in Battle of the Sexes with numbers a, b, c, and d representing the probability of the four strategy proﬁles OO, OF , F O, and F F : O

F

O

a

b

F

c

d

Suppose that the row player receives the recommendation O. Then it knows that the a b and a+b . (The denominator column player will play O and F with probabilities a+b is nonzero since the row player has received the recommendation O.) The deﬁnition

13.2

Equilibrium

385

FF

FO OF OO Equilibria in the Battle of the Sexes. The corners of the outlined simplex correspond to the four pure strategy proﬁles OO, OF , F O, and F F ; the curved surface is the set of distributions where the row and colu