{"title": "A Variational Mean-Field Theory for Sigmoidal Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 374, "page_last": 380, "abstract": null, "full_text": "A variational mean-field theory for \n\nsigmoidal belief networks \n\nc. Bhattacharyya \n\nComputer Science and Automation \n\nIndian Institute of Science \nBangalore, India, 560012 \ncbchiru@csa.iisc.ernet.in \n\nS. Sathiya Keerthi \n\nMechanical and Production Engineering \n\nNational University of Singapore \n\nmpessk@guppy.mpe.nus.edu.sg \n\nAbstract \n\nA variational derivation of Plefka's mean-field theory is presented. \nThis theory is then applied to sigmoidal belief networks with the \naid of further approximations. Empirical evaluation on small scale \nnetworks show that the proposed approximations are quite com(cid:173)\npetitive. \n\n1 \n\nIntroduction \n\nApplication of mean-field theory to solve the problem of inference in Belief Net(cid:173)\nworks(BNs) is well known [1]. In this paper we will discuss a variational mean-field \ntheory and its application to BNs, sigmoidal BNs in particular. \n\nWe present a variational derivation of the mean-field theory, proposed by Plefka[2]. \nThe theory will be developed for a stochastic system, consistin~ of N binary random \nvariables, Si E {O, I}, described by the energy function E(S), and the following \nBoltzmann Gibbs distribution at a temperature T: \n\n_ \n\nP(S) = -z- , z = ~ e-----;y-. \n\n\" \" ' \n\nE(S) \n\n~ \n\ne- T \n\nS \n\nThe application of this mean-field method to Boltzmann Machines(BMs) is already \ndone [3]. A large class of BN s are described by the following energy function: \nf(Mi)} Mi = L WijSj + hi \n\nE(S) = - L {Si In f(Mi) + (1 - Si) In(1 -\n\ni-l \n\nN \n\ni=l \n\nj=l \n\nThe application of the mean-field theory for such energy functions is not straight(cid:173)\nforward and further approximations are needed. We propose a new approximation \nscheme and discuss its utility for sigmoid networks, which is obtained by substitut-\ning \n\n1 \n\nf(x) = 1 + eX \n\n\fin the above energy function. The paper is organized as follows. In section 2 we \npresent a variational derivation of Plefka's mean-field theory. In section 3 the theory \nis extended to sigmoidal belief networks. In section 4 empirical evaluation is done. \nConcluding remarks are given in section 5. \n\n2 A Variational mean-field theory \n\nPlefka,[2] proposed a mean-field theory in the context of spin glasses. This theory \ncan, in principle, yield arbitrarily close approximation to log Z. In this section we \npresent an alternate derivation from a variational viewpoint, see also [4],[5]. \n\nLet 'Y be a real parameter that takes values from 0 to 1. Let us define a 'Y dependent \npartition and distribution function, \n\n(1) \n\nNote that Zl = Z and Pl = p. Introducing an external real vector, Blet us rewrite \n(1) as \n\nZ \n\n,\",e--Y~+2:.(JiS' -\"'(JosoZ-\n'Y = L.... \ns \n\ne LJi ' \n\nZ \n\n, \n\n(2) \n\nwhere Z is the partition function associated with the distribution function p-y given \nby \n\ne --y~+ 2:i (JiSi \n- - - -= - -\n\n.-\nZ \n\n(3) \n\n(4) \n\n(5) \n\n(6) \n\n_ \nZ -\n\nE \n\n' \" \n\n'\"' e--Y\"T+ L.. \u00b0 (JiSi \n-L.... \ns \n\n' , , -\n\nP-\n\nUsing Jensen's Inequality, (e- X ) ~ e-(x), we get \n\nZ-y = Z L p-ye - 2:. (JiSi ~ Z e - 2:. (JiUi \n\nwhere \n\nS \n\nUi = (Si)-\nP-r \n\nTaking logarithms on both sides of (4) we obtain \n\nlog Z-y ~ log Z - L OiUi \n\nThe right hand side is defined as a function of u and 'Y via the following assumption. \nInvertibility assumption: For each fixed u and 'Y, (5) can be solved for if \nIf the invertibility assumption holds then we can use u as the independent vector \n(with B dependent on u) and rewrite (6) as \n\n(7) \n\nwhere G is as defined in \n\nG(u,'Y) = -lnZ + LOiUi. \n\ni \n\n\fThis then gives a variational feel: treat it as an external variable vector and choose \nit to minimize G for a fixed 'Y. The stationarity conditions of the above minimization \nproblem yield \n\n{)G \n\n(Ji = - = O. \n\n()Ui \n\nAt the minimum point we have the equality G = - log Z\"(. \nIt is difficult to invert (5) for'Y :I 0, thus making it impossible to write an algebraic \nexpression for G for any nonzero 'Y. At 'Y = 0 the inversion is straightforward and \none obtains \n\nN \n\nG(it,O) = 2)Ui In Ui + (1 - Ui) In(l- Ui)) , Po = II ui(1 - Ui). \n\n~1 \n\ni \n\nA Taylor series approach is then undertaken around 'Y = 0 to build an approximation \nto G. Define \n\n-\n\nGM = G(u,O) + L kT 8k \n'Y \n\n'Yk ()kG I \n\n_ \n\nk \n\n,,(=0 \n\n(8) \n\nThen G M can be considered as an approximation of G. The stationarity conditions \nare enforced by setting \n\n(Ji = {)G ~ {)GM = O. \n\n{)Ui \n\n{)Ui \n\nIn this paper we will restrict ourselves to M = 2. To do this we need to evaluate \nthe following derivatives \n\n(9) \n\n(10) \n\nwhere \n\nFor M = 1 we have the standard mean-field approach. The expression for M = 2 \ncan be identified with the TAP correction. The term (10) yields the TAP term for \nBM energy function. \n\n3 Mean-field approximations for BNs \n\nThe method, as developed in the previous section, is not directly useful for BNs \nbecause of the intractability of the partial derivatives at 'Y = O. To overcome this \nproblem, we suggest an approximation based on Taylor series expansion. Though \nin this paper we will be restricting ourselves to sigmoid activation function, this \nmethod is applicable to other activation functions also. This method enables cal(cid:173)\nculation of all the necessary terms required for extending Plefka's method for BN s. \nSince, for BN operation T is fixed to 1, T will be dropped from all equations in the \nrest of the paper. \n\n\fLet us define a new energy function \n\nN \n\nE((3,S,il,w) = - 2)Silnf(Mi((3)) + (1- Si)ln(I- f(Mi((3))} \n\n(11) \n\nwhere 0 ~ (3 ~ 1, \n\ni=l \n\ni-l \n\ni-l \n\nMi((3) = L Wij(3(Sj - Uj) + Mi , Mi = L WijUj + hi \n\nj=l \n\nj=l \n\nwhere \n\nUk = L SkP\"(/3 Vk, P\"(/3 = \n\n\u00a7 \n\n- 'VE+\" (J\u00b7S\u00b7 \ne \n\nI Di t \n\n1. \n\n~ 2: .. \n. (J.S. \n\n\" \n6 ; e \u00b7 \n\n- ,,(E+ \n\n(12) \n\nSince (3 is the important parameter, E((3, S, il, w) will be referred to as E((3) so as \nto avoid notational clumsiness. We use a Taylor series approximation of E((3) with \nrespect to (3. Let us define \n\n~ \nEc((3) = E(O) + (; kf o(3k /3=0 \n\n(3k okE I \n\n~ \n\ne \n\nIf Ee approximates E, then we can write \n\nE = E(I) ~ Ec(I). \n\nLet us now define the following function \n\nA(r,(3,il) = _lnL e - \"(E+2:i(JiSi + LBiUi \n\n; \n\n(13) \n\n(14) \n\n(15) \n\nThe Bi are assumed to be functions of il, (3, 'Y, which are obtained by inverting \nequations (12) By replacing E by Ee in (15) we obtain Ae \n\nAe(r, (3, il) = -In L e - \"(Ec+ 2:, (JiS, + L BiUi \n\n(16) \n\n; \n\nwhere the definition of il is obtained by replacing E by Ee. In view of (14) one can \nconsider Ae as an approximation to A. This observation suggests an approximation \nto G. \n\n(17) \nThe required terms needed in the Taylor expansion of G in 'Y can be approximated \nby \n\nG(r, il) = A(r, 1, il) ~ Ae(r, 1, il) \n\nG(O, il) = A(O, 1, il) = Ac(O, 1, il) \nOk Ae I \n\nokG I \no'Yk ,,(=0 = o'Yk ,,(=0,/3=1 ~ o'Yk \n\nOk A I \n\n,,(=0,/3=1 \n\nThe biggest advantage in working with Ae rather than G is that the partial deriva(cid:173)\ntives of Ae with respect to 'Y at 'Y = 0 and (3 = 1 can be expressed as functions of \nil. We define \n\n(18) \n\n\fFigure 1: Three layer BN (2 x 4 x 6) with top down propagation of beliefs. The \nactivation function was chosen to be sigmoid. \n\nIn light of the above discussion one can consider G M \nequations can be stated as \n\n:::::i a MC j hence the mean-field \n\n(}i = aG :::::i aGM :::::i aaMc = 0 \n\naUi \n\naUi \n\naUi \n\n(19) \n\nIn this paper we will restrict ourselves to M = 2. The relevant objective functions \nfor a general C is given by \n\nAll these objective functions can be expressed as a function of u. \n\n4 Experimental results \n\nTo test the approximation schemes developed in the previous schemes, numerical \nexperiments were conducted. Saul et al.[l] pioneered the application of mean-field \ntheory to BNs. We will refer to their method as the SJJ approach. We compare \nour schemes with the SJ J approach. \n\nSmall Networks were chosen so that In Z can be computed by exact enumeration \nfor evaluation purposes. For all the experiments the network topology was fixed \nto the one shown in figure 1. This choice of the network enables us to compare \nthe results with those of [1]. To compare the performance of our methods with \ntheir method we repeated the experiment conducted by them for sigmoid BNs. Ten \nthousand networks were generated by randomly choosing weight values in [-1,1]. \nThe bottom layer units, or the visible units of each network were instantiated to \nzero. The likelihood, In Z, was computed by exact enumeration of all the states in \nthe higher two layers. The approximate value of - In Z was computed by a MC j \nU was computed by solving the fixed point equations obtained from (19). The \ngoodness of approximation scheme was tested by the following measure \n\nc = - aMc -1 \n\nInZ \n\n(22) \n\nFor a proper comparison we also implemented the SJJ method. The goodness \nof approximation for the SJ J scheme is evaluated by substituting a MC, in (22) \nby Lsapprox, for specific formula see [1]. The results are presented in the form \nof histograms in Figure 2. We also repeated the experiment with weights and \n\n\f(\u00a3) \n\n(\u00a3) \n\nsmall weights [-1, 1] \n\nlarge weights [-5,5] \n\nGu \nG12 \nG22 \nSJJ \n\n-0.0404 \n0.0155 \n0.0029 \n0.0157 \n\n-0.0440 \n0.0231 \n-0.0456 \n0.0962 \n\nTable 1: Mean of \u00a3 for randomly generated sigmoid networks, in different weight \nranges. \n\nbiases taking values between -5 and 5, the results are again presented in the form of \nhistograms in Figure 3. The findings are summarized in the form of means tabulated \nin Table l. \nFor small weights G12 and the SJJ approach show close results, which was expected. \nBut the improvement achieved by the G22 scheme is remarkable; it gave a mean \nvalue of 0.0029 which compares substantially well against the mean value of 0.01139 \nreported in [6]. The improvement in [6] was achieved by using mixture distribution \nwhich requires introduction of extra variational variables; more than 100 extra vari(cid:173)\national variables are needed for a 5 component mixture. This results in substantial \nincrease in the computation costs. On the other hand the extra computational \ncost for G22 over G12 is marginal. This makes the G22 scheme computationally \nattractive over the mixture distribution. \n\nFigure 2: Histograms for GlO and SJJ scheme for weights taking values in [-1,1], \nfor sigmoid networks. The plot on the left show histograms for \u00a3 for the schemes \nGu and G12 They did not have any overlaps; Gu , gives a mean of -0.040 while G12 \ngives a mean of 0.0155. The middle plot shows the histogram for the SJJ scheme, \nmean is given by 0.0157.The plot at the extreme right is for the scheme G22 , having \na mean of 0.0029 \n\n\" \n\n,\\ 0 '\" \n\nOf the three schemes G12 is the most robust and also yields reasonably accurate \nresults. It is outperformed only by G22 in the case of sigmoid networks with low \nweights. Empirical evidence thus suggests that the choice of a scheme is not straight(cid:173)\nforward and depends on the activation function and also parameter values. \n\n\fFigure 3: Histograms for the G10 and SJJ schemes for weights taking values in \n[-5,5] for sigmoid networks. The leftmost histogram shows \u00a3 for G11 scheme having \na mean of -0.0440, second from left is for G12 scheme having a mean of 0.0231, and \nsecond from right is for SJJ scheme, having a mean of 0.0962. The scheme G22 is \nat the extreme right with mean -0.0456. \n\n5 Discussion \n\nApplication of Plefka's theory to BNs is not straightforward. It requires compu(cid:173)\ntation of some averages which are not tractable. We presented a scheme in which \nthe BN energy function is approximated by a Taylor series, which gives a tractable \napproximation to the terms required for Plefka's method. Various approximation \nschemes depending on the degree of the Taylor series expansion are derived. Unlike \nthe approach in [1], the schemes discussed here are simpler as they do not introduce \nextra variational variables. Empirical evaluation on small scale networks shows that \nthe quality of approximations is quite good. For a more detailed discussion of these \npoints see [7]. \n\nReferences \n\n[1] Saul, L. K. and Jaakkola, T. and Jordan, M. 1.(1996), Mean field theory for sigmoid \n\nbelief networks, Journal of Artificial Intelligence Research,4 \n\n[2] Plefka, T . (1982), Convergence condition of the TAP equation for the Infinite-ranged \n\nIsing glass model,J. Phys. A: Math. Gen.,15 \n\n[3] Kappen, H. J and Rodriguez, F. B(1998), Boltzmann machine learning using mean \n\nfield theory and linear response correction, Advances in Neural Information Process(cid:173)\ning Systems 10, (eds.) M. I. Jordan and M. J. Kearns and S. A. Solla, MIT press \n\n[4] Georges, A. and Yedidia, J. S.(1991), How to expand around mean-field theory using \n\nhigh temperature expansions,J. Phys. A: Math. Gen., 24 \n\n[5] Bhattacharyya, C. and Keerthi, S. S.(2000), Information geometry and Plefka's mean(cid:173)\n\nfield theory, J. Phys. A: Math. Gen.,33 \n\n[6] Bishop, M. C. and Lawrence, N. and Jaakkola, T. and Jordan, M. 1.(1997), Approxi(cid:173)\n\nmating Posterior Distributions in Belief Networks using Mixtures, Advances in Neural \nInformation Processing Systems 10, (eds.) Jordan, M. I. and Kearns, M. J. and Solla, \nS., MIT press \n\n[7] Bhattacharyya, C. and Keerthi, S. S. (1999), Mean field theory for a special class of \n\nbelief networks, accepted in Journal of Artificial Intelligence Research \n\n\f", "award": [], "sourceid": 1918, "authors": [{"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": null}, {"given_name": "S.", "family_name": "Keerthi", "institution": null}]}