Monday, 25 May 2015

Accounting for Data Mining Bias

I've recently subscribed to this forexfactory thread, which is about using machine learning to develop trading systems, and the subject of data mining/data dredging has come up. This post is a short description of how mining/dredging can be accounted for, but readers should be aware that the following is not a precise description of any particular test with accompanying code, but rather a hand wavy description of a general but important principle.

Suppose one has conducted a whole series of tests on a particular set of data with a view to developing a trading system. The precise nature of this is not really important - it could be some machine learning approach, a grid search of moving average crossover parameter values, a series of elimination contests to find the "best" indicator, or whatever. While doing this we keep a record of all our results and when the search is complete we plot a histogram thus:-
which is the result of 160,000 distinct tests plotted in 200 bins. Naturally, having done this, we select the best system found, represented by the vertical cursor line at x-axis value 5.2. This 5.2 is our test metric of choice, be it Sharpe ratio, win to loss ratio, whatever. But then we ask ourselves whether we have truly found a world beating system or is this discovery the result of data mining?

To test this, we create a random set of data which has the same attributes as the real data used above. The random data can be obtained by Bootstrapping, random permutation, application of a Markov chain with state spaces derived from the original data etc. The actual choice of which to use will depend on the null hypothesis one wants to test. Having obtained our random data set, we then perform the exact same search as we did above and record the test metric of best performing system found on this random data set. We repeat this 160,000 times and then plot a histogram ( in red ) of the best test results over all these random data sets:-
We find that this random set has a mean value of 0.5 and a standard deviation of 0.2. What this red test set represents is the ability/power of our machine learning algo, grid search criteria etc. to uncover "good" systems in even meaningless data, where all relationships are, in effect, spurious and contain no predictive ability.

We must now suppose that this power to uncover spurious relationships also exists in our original set of tests on the real data, and it must be accounted for. For purposes of illustration I'm going to take a naive approach and take 4 times the standard deviation plus the mean of the red distribution and shift our original green distribution to the right by an amount equal to this sum, a value of 1.3 thus:-
We now see that our original test metric value of 5.2, which was well out in the tail of the non-shifted green distribution, is comfortably within the tail of the shifted distribution, and depending on our choice of p-value etc. we may not be able to reject our null hypothesis, whatever it may have been.

As I warned readers above, this is not supposed to be a mathematically rigorous exposition of how to account for data mining bias, but rather an illustrative explanation of the principle(s) behind accounting for it. The main take away is that the red distribution, whatever it is for the test(s) you are running, needs to be generated and then the tests on real data need to be appropriately discounted by the relevant measures taken from the red distribution before any inferences are drawn about the efficacy of the results on the real data.

For more information about data mining tests readers might care to visit a Github repository I have created, which contains code and some academic papers on the subject. 

Thursday, 23 April 2015

A Simple Visual Test of CRBM Performance

Following on from the successful C++ .oct coding of the Gaussian and Binary units, I thought I would conduct a simple visual test of the conditional restricted boltzmann machine, both as a test of the algorithm itself and of my coding of the .oct functions. For this I selected a "difficult" set of price bars from the EURUSD forex pair, difficult in the sense that it is a ranging market with opportunities for the algorithm to fail at market turns, and this is shown below:
For this test there was no optimisation whatsoever; I simply chose a model order of 3, 50 hidden units for both the Gaussian and binary units, the top 500 training examples from my Cauchy-Schwarz matching algorithm, 100 training epochs and 3 sets of features each to model the next day's open, the 3 day maximum high and the 3 day minimum low. These training targets are different from the targets I presented earlier, where I modelled the next 3 OHLC bars individually, because of the results of a very simple analysis of what I will be trying to predict with the CRBM.

The video below presents the results of this visual test. It shows a sliding window of 9 bars moved from left to right over the bars shown in the chart above. In this window the first 6 bars are used to initialise the CRBM, with the 6th bar being the most recent, and the 7th, 8th and 9th bars are the "future" bars for which the CRBM models the open price of the 7th bar and the 3 bar max high and min low of the 7th, 8th and 9th bars. The open price level is shown by the horizontal black line, the max high by the blue line and the min low by the red line.
In viewing this readers should bear in mind that these levels will be the inputs to my MFEMAE indicator and so the accuracy of the absolute levels is not as important as the ratios between them. However, that said, I am quite impressed with this unoptimised performance and I am certain than this can be improved with cross validated optimisation of the various parameters. This will be my focus for the nearest future.

Monday, 20 April 2015

Optimised CRBM Code for Binary Units

Following on from my previous post, in the code box below there is the C++ .oct code for training the binary units of the CRBM and, as with the code for training the Gaussian units, the code is liberally commented. There is only a slight difference between the two code files.
DEFUN_DLD ( cc_binary_crbm_mersenne , args , nargout ,
"-*- texinfo -*-\n\
@deftypefn {Function File} {} cc_binary_crbm_mersenne (@var{ batchdata , minibatch , nt , num_epochs , num_hid })\n\n\
This function trains a binary crbm. The value nt is the order of the model, i.e. how many previous values should be included,\n\
num_epochs is the number of training epochs, and num_hid is the number of nodes in the hidden layer.\n\
@end deftypefn" )

{
octave_value_list retval_list ;
int nargin = args.length () ;

// check the input arguments
if ( nargin != 5 )
   {
   error ( "Input error: there are insufficient input arguments. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(0).is_real_matrix () ) // check batchdata
   {
   error ( "Input error: the 1st argument, batchdata, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(1).is_real_matrix () ) // check minibatch
   {
   error ( "Input error: the 2nd argument, minibatch, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }   
   
if ( args(2).length () != 1 ) // check nt
   {
   error ( "Input error: nt should be an interger value for the 'order' of the model. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(3).length () != 1 ) // num_epochs
   {
   error ( "Input error: num_epochs should be an integer value for the number of training epochs. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(4).length () != 1 ) // check num_hid
   {
   error ( "Input error: num_hid should be an integer for the number of nodes in hidden layer. Type help for more details." ) ;
   return retval_list ;
   }
     
if ( error_state )
   {
   error ( "Input error: type help for details." ) ;
   return retval_list ;
   }
// end of input checking
  
// inputs
Matrix batchdata = args(0).matrix_value () ;
Matrix minibatch = args(1).matrix_value () ;
int nt = args(2).int_value () ; // the "order" of the model
int num_epochs = args(3).int_value () ;
int num_hid = args(4).int_value () ;

// variables
// batchdata is a big matrix of all the data and we index it with "minibatch", a matrix of mini-batch indices in the columns                       
int num_cases = minibatch.rows () ;                                                     // Octave code ---> num_cases = length( minibatch{ batch } ) ;
int num_dims = batchdata.cols () ;                                                      // visible dimension
Matrix bi_star ( num_dims , num_cases ) ; bi_star.fill( 0.0 ) ;                         // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; 
Matrix bj_star ( num_hid , num_cases ) ; bj_star.fill( 0.0 ) ;                          // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; 
Matrix repmat_bj ( num_hid , num_cases ) ; repmat_bj.fill( 0.0 ) ;                      // for Octave code ---> repmat( bj , 1 , num_cases )
Matrix repmat_bi ( num_dims , num_cases ) ; repmat_bi.fill( 0.0 ) ;                     // for Octave code ---> repmat( bi , 1 , num_cases )
Matrix eta ( num_hid , num_cases ) ; eta.fill( 0.0 ) ;
Matrix h_posteriors ( num_hid , num_cases ) ; h_posteriors.fill( 0.0 ) ;                // for the logistic function
Matrix ones ( num_hid , num_cases ) ; ones.fill( 1.0 ) ;                                // for the logistic function
Matrix ones_2 ( num_cases , num_hid ) ; ones_2.fill( 1.0 ) ;                            // for the logistic function of negdata
Matrix hid_states ( num_cases , num_hid ) ; hid_states.fill( 0.0 ) ;                    // for hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
Matrix w_grad ( num_hid , num_dims ) ; w_grad.fill( 0.0 ) ;                             // for w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
Matrix bi_grad ( num_dims , 1 ) ; bi_grad.fill( 0.0 ) ;                                 // for bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix bj_grad ( num_hid , 1 ) ; bj_grad.fill( 0.0 ) ;                                  // for bj_grad = sum( hid_states , 1 )' ;
Matrix topdown ( num_cases , num_dims ) ; topdown.fill( 0.0 ) ;                         // for topdown = gsd .* ( hid_states * w ) ;
Matrix negdata ( num_cases , num_dims ) ; negdata.fill( 0.0 ) ;
Matrix negdata_transpose ( num_dims , num_cases ) ; negdata_transpose.fill( 0.0 ) ;
Matrix bi_transpose ( 1 , num_dims ) ; bi_transpose.fill( 0.0 ) ;
Matrix repmat_bi_transpose ( num_cases , num_dims ) ; repmat_bi_transpose.fill( 0.0 ) ; 
Matrix neg_w_grad ( num_hid , num_dims ) ; neg_w_grad.fill( 0.0 ) ;
Matrix neg_bi_grad ( num_dims , 1 ) ; neg_bi_grad.fill( 0.0 ) ;                         // for neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix neg_bj_grad ( num_hid , 1 ) ; neg_bj_grad.fill( 0.0 ) ;                          // for neg_bj_grad = sum( h_posteriors , 2 ) ;
  
// Setting learning rates and create some utility matrices
Matrix epsilon_w ( num_hid , num_dims ) ; epsilon_w.fill( 0.001 ) ;   // undirected
Matrix epsilon_bi ( num_dims , 1 ) ; epsilon_bi.fill( 0.001 ) ;       // visibles
Matrix epsilon_bj ( num_hid , 1 ) ; epsilon_bj.fill( 0.001 ) ;        // hidden units
Matrix epsilon_A ( num_dims , num_dims ) ; epsilon_A.fill( 0.001 ) ;  // autoregressive
Matrix epsilon_B ( num_hid , num_dims ) ; epsilon_B.fill( 0.001 ) ;   // prev visibles to hidden
Matrix w_decay ( num_hid , num_dims ) ; w_decay.fill( 0.0002 ) ;      // currently we use the same weight decay for w, A, B
Matrix w_decay_A ( num_dims , num_dims ) ; w_decay_A.fill( 0.0002 ) ; // currently we use the same weight decay for w, A, B
Matrix w_decay_B ( num_hid , num_dims ) ; w_decay_B.fill( 0.0002 ) ;  // currently we use the same weight decay for w, A, B

Matrix momentum_w ( num_hid , num_dims ) ; momentum_w.fill( 0.0 ) ;   // momentum used only after 5 epochs of training, when it will be set to 0.9
Matrix num_cases_matrices_w_and_B ( num_hid , num_dims ) ; num_cases_matrices_w_and_B.fill( num_cases ) ;

Matrix momentum_bi ( num_dims , 1 ) ; momentum_bi.fill( 0.0 ) ;
Matrix num_cases_matrix_bi ( num_dims , 1 ) ; num_cases_matrix_bi.fill( num_cases ) ;

Matrix momentum_bj ( num_hid , 1 ) ; momentum_bj.fill( 0.0 ) ;
Matrix num_cases_matrix_bj ( num_hid , 1 ) ; num_cases_matrix_bj.fill( num_cases ) ;

Matrix momentum_A ( num_dims , num_dims ) ; momentum_A.fill( 0.0 ) ;
Matrix num_cases_matrix_A ( num_dims , num_dims ) ; num_cases_matrix_A.fill( num_cases ) ;

Matrix momentum_B ( num_hid , num_dims ) ; momentum_B.fill( 0.0 ) ;

// initialization of output matrices
Matrix w ( num_hid , num_dims ) ; w.fill( 0.0 ) ; // Octave code ---> w = 0.01 * randn( num_hid , num_dims ) ;
Matrix bi ( num_dims , 1 ) ; bi.fill( 0.0 ) ;     // Octave code ---> bi = 0.01 * randn( num_dims , 1 ) ;
Matrix bj( num_hid , 1 ) ; bj.fill( 0.0 ) ;       // Octave code ---> bj = -1 + 0.01 * randn( num_hid , 1 ) ; // set to favour units being "off"

// The autoregressive weights; A( : , : , j ) is the weight from t-j to the visible
NDArray A ( dim_vector( num_dims , num_dims , nt ) ) ; A.fill( 0.0 ) ; // Octave code ---> A = 0.01 * randn( num_dims ,num_dims , nt ) ;

// The weights from previous time-steps to the hiddens; B( : , : , j ) is the weight from t-j to the hidden layer
NDArray B ( dim_vector( num_hid , num_dims , nt ) ) ; B.fill( 0.0 ) ;  // Octave code ---> B = 0.01 * randn( num_hid , num_dims , nt ) ;

// Declare MersenneTwister random values
MTRand mtrand1 ;
double rand_norm_value ;
double rand_uniform_value ;

// nested loops to fill w, bi, bj, A and B with initial random values
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
    { 
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bj ( ii_nest_loop , 0 ) = -1.0 + rand_norm_value ; // set to favour units being "off"

     for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims ; jj_nest_loop++ )
         {
         rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
         w ( ii_nest_loop , jj_nest_loop ) = rand_norm_value ;
         }
    }
    
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
    {
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bi ( ii_nest_loop , 0 ) = rand_norm_value ;
    }

for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               A ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop
    
for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               B ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop        
    
// keep previous updates around for momentum
Matrix w_update ( num_hid , num_dims ) ; w_update.fill( 0.0 ) ; // Octave code ---> w_update = zeros( size( w ) ) ;
Matrix bi_update ( num_dims , 1 ) ; bi_update.fill( 0.0 ) ;     // Octave code ---> bi_update = zeros( size( bi ) ) ;
Matrix bj_update ( num_hid , 1 ) ; bj_update.fill( 0.0 ) ;      // Octave code ---> bj_update = zeros( size( bj ) ) ;

NDArray A_update ( dim_vector( num_dims , num_dims , nt ) ) ; A_update.fill( 0.0 ) ; // Octave code ---> A_update = zeros( size( A ) ) ;
Matrix A_extract ( num_dims , num_dims ) ; A_extract.fill( 0.0 ) ;                   // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix A_update_extract ( num_dims , num_dims ) ; A_update_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_update ( dim_vector( num_hid , num_dims , nt ) ) ; B_update.fill( 0.0 ) ;  // Octave code ---> B_update = zeros( size( B ) ) ;
Matrix B_extract ( num_hid , num_dims ) ; B_extract.fill( 0.0 ) ;                    // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix B_update_extract ( num_hid , num_dims ) ; B_update_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArray

// data is a nt+1 dimensional array with current and delayed data corresponding to mini-batches
// num_cases = minibatch( batch ).length () ;
num_cases = minibatch.rows () ;
NDArray data ( dim_vector( num_cases , num_dims , nt + 1 ) ) ; data.fill( 0.0 ) ; // Octave code ---> data = zeros( num_cases , num_dims , nt + 1 ) ;
Matrix data_extract ( num_cases , num_dims ) ; data_extract.fill( 0.0 ) ;         // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_transpose ( num_dims , num_cases ) ; data_transpose.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0 ( num_cases , num_dims ) ; data_0.fill( 0.0 ) ;                     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0_transpose ( num_dims , num_cases ) ; data_0_transpose.fill( 0.0 ) ; // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; A_grad.fill( 0.0 ) ;  // for A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;                                    
Matrix A_grad_extract ( num_dims , num_dims ) ; A_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; neg_A_grad.fill( 0.0 ) ; // for neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
Matrix neg_A_grad_extract ( num_dims , num_dims ) ; neg_A_grad_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; B_grad.fill( 0.0 ) ;          // for B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
Matrix B_grad_extract ( num_hid , num_dims ) ; B_grad_extract.fill( 0.0 ) ;              // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; neg_B_grad.fill( 0.0 ) ;  // for neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;
Matrix neg_B_grad_extract ( num_hid , num_dims ) ; neg_B_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

Array  p ( dim_vector ( nt , 1 ) , 0 ) ;                                // vector for writing to A_grad and B_grad

// end of initialization of matrices

// %%%%%%%%% THE MAIN CODE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for ( octave_idx_type epoch ( 0 ) ; epoch < num_epochs ; epoch++ ) // main epoch loop
    { 
//     // errsum = 0 ; % keep a running total of the difference between data and recon
//       
 for ( octave_idx_type batch ( 0 ) ; batch < minibatch.cols () ; batch++ ) // Octave code ---> int num_batches = minibatch.length () ;
     {
// %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%      

// // These next nested loops fill the data NDArray with the values from batchdata indexed
// // by the column values of minibatch. Octave code equivalent given below:-
// // Octave code ---> mb = minibatch{ batch } ; % caches the indices   
// // Octave code ---> data( : , : , 1 ) = batchdata( mb , : ) ;
// // Octave code ---> for hh = 1 : nt
// // Octave code ---> data( : , : , hh + 1 ) = batchdata( mb - hh , : ) ;
// // Octave code ---> end
     for ( octave_idx_type hh ( 0 ) ; hh < nt + 1 ; hh++ )
         {
    for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ )
        {
   for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
       {
       data( ii_nest_loop , jj_nest_loop , hh ) = batchdata( minibatch( ii_nest_loop , batch ) - 1 - hh , jj_nest_loop ) ; // -1 for .oct zero based indexing vs Octave's 1 based
       } // end of jj_nest_loop loop      
        }       // end of ii_nest_loop loop
         }             // end of hh loop

// The above data filling loop could perhaps be implemented more quickly using the fortrans vec method as below
// http://stackoverflow.com.80bola.com/questions/28900153/create-a-ndarray-in-an-oct-file-from-a-double-pointer
// NDArray a (dim_vector(dim[0], dim[1], dim[2]));
// 
// Then loop over (i, j, k) indices to copy the cube to the octave array
// 
// double* a_vec = a.fortran_vec ();
// for (int i = 0; i < dim[0]; i++) {
//     for (int j = 0; j < dim[1]; j++) {
//         for (int k = 0; k < dim[2]; k++) {
//             *a_vec++ = armadillo_cube(i, j, k);
//         }
//     }
// }

// calculate contributions from directed autoregressive connections and contributions from directed visible-to-hidden connections
            bi_star.fill( 0.0 ) ; // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; ( matrix declared earlier in code above )
            bj_star.fill( 0.0 ) ; // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; ( matrix declared earlier in code above )
            
            // The code below calculates two separate Octave code loops in one nested C++ loop structure, namely
            // Octave code ---> for hh = 1 : nt       
            //                      bi_star = bi_star +  A( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            // and
            // Octave code ---> for hh = 1:nt
            //                      bj_star = bj_star + B( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                // fill the intermediate calculation matrices 
                A_extract = A.page ( hh ) ;
                B_extract = B.page ( hh ) ;
                data_transpose = ( data.page ( hh + 1 ) ).transpose () ;

                // add up the hh different matrix multiplications
                bi_star += A_extract * data_transpose ; 
                bj_star += B_extract * data_transpose ;
     
         } // end of hh loop
         
                // extract and pre-calculate to save time in later computations
                data_0 = data.page ( 0 ) ;
                data_0_transpose = data_0.transpose () ;
 
// Calculate "posterior" probability -- hidden state being on ( Note that it isn't a true posterior )   
//          Octave code ---> eta = w * ( data( : , : , 1 ) ./ gsd )' + ... % bottom-up connections
//                                 repmat( bj , 1 , num_cases ) + ...      % static biases on unit
//                                 bj_star ;                               % dynamic biases

//          get repmat( bj , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
         {
         repmat_bj.insert( bj , 0 , jj_nest_loop ) ; 
         }

            // bottom_up = w * data( : , : , 1 )' ;
            eta = w * data_0_transpose + repmat_bj + bj_star ;       
 
            // h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
            // -exponate the eta matrix
            for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
                { 
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
               {
               eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
               }
         }

            // element division  A./B == quotient(A,B) 
            h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Activate the hidden units    
//          Octave code ---> hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
     for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < hid_states.rows () ; ii_nest_loop++ )
         {
         for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < hid_states.cols () ; jj_nest_loop++ )
             {
      rand_uniform_value = mtrand1.randDblExc() ; // a real number in the range 0 to 1, excluding both 0 and 1
      hid_states( ii_nest_loop , jj_nest_loop ) = h_posteriors( jj_nest_loop , ii_nest_loop ) > rand_uniform_value ? 1.0 : 0.0 ;
             }
         } // end of hid_states loop

// Calculate positive gradients ( note w.r.t. neg energy )
// Octave code ---> w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
//                  bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
//                  bj_grad = sum( hid_states , 1 )' ;
            w_grad = hid_states.transpose () * data_0 ;

//          get repmat( bi , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
                {
                repmat_bi.insert( bi , 0 , jj_nest_loop ) ; 
                }  
            bi_grad = ( data_0_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
            bj_grad = ( hid_states.sum ( 0 ) ).transpose () ;
    
// Octave code ---> for hh = 1 : nt      
//                  A_grad( : , : , hh ) = ( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                  B_grad( : , : , hh ) = hid_states' * data( : , : , hh + 1 ) ;      
//                  end
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                p( 2 ) = hh ;                                                                    // set Array  p to write to page hh of A_grad and B_grad NDArrays
                data_extract = data.page ( hh + 1 ) ;                                            // get the equivalent of data( : , : , hh + 1 )
                A_grad.insert( ( data_0_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
                B_grad.insert( hid_states.transpose () * data_extract , p ) ;
                }
// the above code comes from http://stackoverflow.com/questions/29572075/how-do-you-create-an-arrayoctave-idx-type-in-an-oct-file
// e.g.
//
// Array p (dim_vector (3, 1));
// int n = 2;
// dim_vector dim(n, n, 3);
// NDArray a_matrix(dim);
// 
// for (octave_idx_type i = 0; i < n; i++)
//   for (octave_idx_type j = 0; j < n; j++)
//     a_matrix(i,j, 1) = (i + 1) * 10 + (j + 1);
// 
// std::cout << a_matrix;
// 
// Matrix b_matrix = Matrix (n, n);
// b_matrix(0, 0) = 1; 
// b_matrix(0, 1) = 2; 
// b_matrix(1, 0) = 3; 
// b_matrix(1, 1) = 4; 
// std::cout << b_matrix;
// 
// Array p (dim_vector (3, 1), 0);
// p(2) = 2;
// a_matrix.insert (b_matrix, p);
// 
// std::cout << a_matrix;

// %%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
 
// Activate the visible units
// Octave code ---> topdown = hid_states * w ;
   topdown = hid_states * w ;
    
// get repmat( bi' , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
   bi_transpose = bi.transpose () ;
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ ) // loop over the rows
       {
       repmat_bi_transpose.insert( bi_transpose , ii_nest_loop , 0 ) ; 
       }
       
       eta = topdown + repmat_bi_transpose + bi_star.transpose () ;

// Octave code ---> negdata = 1 ./ ( 1 + exp( -eta ) ) ; % logistic
// -exponate the eta matrix
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
       { 
        for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
            {
            eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
            }
       }   
  
// element division  A./B == quotient(A,B) 
   negdata = quotient( ones_2 , ( ones_2 + eta ) ) ;   
   
// Now conditional on negdata, calculate "posterior" probability for hiddens
// Octave code ---> eta = w * ( negdata ./ gsd )' + ...      % bottom-up connections
//                        repmat( bj , 1 , num_cases ) + ... % static biases on unit (no change)
//                        bj_star ;                          % dynamic biases (no change)

   negdata_transpose = negdata.transpose () ;             // to save repetition of transpose
   eta = w * negdata_transpose + repmat_bj + bj_star ;
   
// h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
// -exponate the eta matrix
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
       { 
        for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
            {
            eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
            }
       }

// element division  A./B == quotient(A,B) 
   h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Calculate negative gradients
// Octave code ---> neg_w_grad = h_posteriors * ( negdata ./ gsd ) ; % not using activations
   neg_w_grad = h_posteriors * negdata ; // not using activations
   
// Octave code ---> neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
   neg_bi_grad = ( negdata_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
   
// Octave code ---> neg_bj_grad = sum( h_posteriors , 2 ) ;
   neg_bj_grad = h_posteriors.sum ( 1 ) ;
   
// Octave code ---> for hh = 1 : nt
//                      neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                      neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;        
//                  end
   
   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;                                                                            // set Array  p to write to page hh of A_grad and B_grad NDArrays
       
       data_extract = data.page ( hh + 1 ) ;                                                    // get the equivalent of data( : , : , hh + 1 )
       neg_A_grad.insert( ( negdata_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
       neg_B_grad.insert( h_posteriors * data_extract , p ) ;
   
       } // end of hh loop   

// %%%%%%%%% END NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

// Octave code ---> err = sum( sum( ( data( : , : , 1 ) - negdata ) .^2 ) ) ;
// Not used         errsum = err + errsum ;

// Octave code ---> if ( epoch > 5 ) % use momentum
//                  momentum = mom ;
//                  else % no momentum
//                  momentum = 0 ;
//                  end

   // momentum was initialised to 0.0, but on the 6th iteration of epoch, set momentum to 0.9
   if ( epoch == 5 ) // will only be true once, after which momentum will == 0.9
      {
        momentum_w.fill( 0.9 ) ;
        momentum_bi.fill( 0.9 ) ;
        momentum_bj.fill( 0.9 ) ;
        momentum_A.fill( 0.9 ) ;
        momentum_B.fill( 0.9 ) ;
      }
     
// %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%        

// Octave code ---> w_update = momentum * w_update + epsilon_w * ( ( w_grad - neg_w_grad ) / num_cases - w_decay * w ) ;
   w_update = product( momentum_w , w_update ) + product( epsilon_w , quotient( ( w_grad - neg_w_grad ) , num_cases_matrices_w_and_B ) - product( w_decay , w ) ) ;
   
// Octave code ---> bi_update = momentum * bi_update + ( epsilon_bi / num_cases ) * ( bi_grad - neg_bi_grad ) ;
   bi_update = product( momentum_bi , bi_update ) + product( quotient( epsilon_bi , num_cases_matrix_bi ) , ( bi_grad - neg_bi_grad ) ) ;
   
// Octave code ---> bj_update = momentum * bj_update + ( epsilon_bj / num_cases ) * ( bj_grad - neg_bj_grad ) ;
   bj_update = product( momentum_bj , bj_update ) + product( quotient( epsilon_bj , num_cases_matrix_bj ) , ( bj_grad - neg_bj_grad ) ) ;
 
// The following two Octave code loops are combined into the single .oct loop that follows them
//   
// Octave code ---> for hh = 1 : nt
//                      A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;
//                      B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
//                  end
   
// Octave code ---> for hh = 1 : nt
//                      A( : , : , hh ) = A( : , : , hh ) + A_update( : , : , hh ) ;
//                      B( : , : , hh ) = B( : , : , hh ) + B_update( : , : , hh ) ;
//                  end   

   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;
   
       A_update_extract = A_update.page ( hh ) ;
       A_grad_extract = A_grad.page ( hh ) ;
       neg_A_grad_extract = neg_A_grad.page ( hh ) ;
       A_extract = A.page ( hh ) ;
       A_update.insert( product( momentum_A , A_update_extract ) + product( epsilon_A , ( quotient( ( A_grad_extract - neg_A_grad_extract ) , num_cases_matrix_A ) - product( w_decay_A , A_extract ) ) ) , p ) ;
       A_update_extract = A_update.page ( hh ) ;
       A.insert( A_extract + A_update_extract , p ) ;
   
       B_update_extract = B_update.page ( hh ) ;
       B_grad_extract = B_grad.page ( hh ) ;
       neg_B_grad_extract = neg_B_grad.page ( hh ) ;
       B_extract = B.page ( hh ) ;
       B_update.insert( product( momentum_B , B_update_extract ) + product( epsilon_B , ( quotient( ( B_grad_extract - neg_B_grad_extract ) , num_cases_matrices_w_and_B ) - product( w_decay_B , B_extract ) ) ) , p ) ;
       B_update_extract = B_update.page ( hh ) ;
       B.insert( B_extract + B_update_extract , p ) ;
       
       } // end of hh loop
  
// Octave code ---> w = w + w_update ;
//                  bi = bi + bi_update ;
//                  bj = bj + bj_update ;

   w += w_update ;
   bi += bi_update ;
   bj += bj_update ;
       
// %%%%%%%%%%%%%%%% END OF UPDATES  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            } // end of batch loop

    } // end of main epoch loop

// return the values w , bj , bi , A , B
retval_list(4) = B ;
retval_list(3) = A ;
retval_list(2) = bi ;    
retval_list(1) = bj ;
retval_list(0) = w ;

return retval_list ;
  
} // end of function

Thursday, 16 April 2015

Optimised CRBM Code for Gaussian Units

Over the last few weeks I have been working on optimising the conditional restricted boltzmann machine code, with a view to speeding it up via a C++ .oct file, and in the code box below is this .oct code for the gaussian_crbm.m code in my previous post. This gaussian_crbm.m function, plus the binary_crbm.m one, are the speed bottlenecks whilst training the crbm. In the code below I have made some adjustments for code simplification purposes, the most important of which are:
  • the minibatch index is a matrix rather than a cell type index, with the individual batch indexes in the columns, which means that the batch sizes are all equal in length.
  • as a result all data in cell arrays in the .m function are in NDArrays in the .oct function
  • there is no input for the variable "gsd" because I have assumed a value of 1 applies. This means the input data must be normalised prior to the function call.
DEFUN_DLD ( cc_gaussian_crbm_mersenne , args , nargout ,
"-*- texinfo -*-\n\
@deftypefn {Function File} {} cc_gaussian_crbm_mersenne (@var{ batchdata , minibatch , nt , num_epochs , num_hid })\n\n\
This function trains a real valued, gaussian crbm where the gsd is assumed to be 1, so the batchdata input must be z-score normalised.\n\
The value nt is the order of the model, i.e. how many previous values should be included, num_epochs is the number of training epochs,\n\
and num_hid is the number of nodes in the hidden layer.\n\
@end deftypefn" )

{
octave_value_list retval_list ;
int nargin = args.length () ;

// check the input arguments
if ( nargin != 5 )
   {
   error ( "Input error: there are insufficient input arguments. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(0).is_real_matrix () ) // check batchdata
   {
   error ( "Input error: the 1st argument, batchdata, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(1).is_real_matrix () ) // check minibatch
   {
   error ( "Input error: the 2nd argument, minibatch, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }   
   
if ( args(2).length () != 1 ) // check nt
   {
   error ( "Input error: nt should be an interger value for the 'order' of the model. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(3).length () != 1 ) // num_epochs
   {
   error ( "Input error: num_epochs should be an integer value for the number of training epochs. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(4).length () != 1 ) // check num_hid
   {
   error ( "Input error: num_hid should be an integer for the number of nodes in hidden layer. Type help for more details." ) ;
   return retval_list ;
   }
     
if ( error_state )
   {
   error ( "Input error: type help for details." ) ;
   return retval_list ;
   }
// end of input checking
  
// inputs
Matrix batchdata = args(0).matrix_value () ;
Matrix minibatch = args(1).matrix_value () ;
int nt = args(2).int_value () ; // the "order" of the model
int num_epochs = args(3).int_value () ;
int num_hid = args(4).int_value () ;

// variables
// batchdata is a big matrix of all the data and we index it with "minibatch", a matrix of mini-batch indices in the columns                       
int num_cases = minibatch.rows () ;                                                     // Octave code ---> num_cases = length( minibatch{ batch } ) ;
int num_dims = batchdata.cols () ;                                                      // visible dimension
Matrix bi_star ( num_dims , num_cases ) ; bi_star.fill( 0.0 ) ;                         // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; 
Matrix bj_star ( num_hid , num_cases ) ; bj_star.fill( 0.0 ) ;                          // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; 
Matrix repmat_bj ( num_hid , num_cases ) ; repmat_bj.fill( 0.0 ) ;                      // for Octave code ---> repmat( bj , 1 , num_cases )
Matrix repmat_bi ( num_dims , num_cases ) ; repmat_bi.fill( 0.0 ) ;                     // for Octave code ---> repmat( bi , 1 , num_cases )
Matrix eta ( num_hid , num_cases ) ; eta.fill( 0.0 ) ;
Matrix h_posteriors ( num_hid , num_cases ) ; h_posteriors.fill( 0.0 ) ;                // for the logistic function
Matrix ones ( num_hid , num_cases ) ; ones.fill( 1.0 ) ;                                // for the logistic function
Matrix hid_states ( num_cases , num_hid ) ; hid_states.fill( 0.0 ) ;                    // for hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
Matrix w_grad ( num_hid , num_dims ) ; w_grad.fill( 0.0 ) ;                             // for w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
Matrix bi_grad ( num_dims , 1 ) ; bi_grad.fill( 0.0 ) ;                                 // for bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix bj_grad ( num_hid , 1 ) ; bj_grad.fill( 0.0 ) ;                                  // for bj_grad = sum( hid_states , 1 )' ;
Matrix topdown ( num_cases , num_dims ) ; topdown.fill( 0.0 ) ;                         // for topdown = gsd .* ( hid_states * w ) ;
Matrix negdata ( num_cases , num_dims ) ; negdata.fill( 0.0 ) ;
Matrix negdata_transpose ( num_dims , num_cases ) ; negdata_transpose.fill( 0.0 ) ;
Matrix bi_transpose ( 1 , num_dims ) ; bi_transpose.fill( 0.0 ) ;
Matrix repmat_bi_transpose ( num_cases , num_dims ) ; repmat_bi_transpose.fill( 0.0 ) ; 
Matrix neg_w_grad ( num_hid , num_dims ) ; neg_w_grad.fill( 0.0 ) ;
Matrix neg_bi_grad ( num_dims , 1 ) ; neg_bi_grad.fill( 0.0 ) ;                         // for neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix neg_bj_grad ( num_hid , 1 ) ; neg_bj_grad.fill( 0.0 ) ;                          // for neg_bj_grad = sum( h_posteriors , 2 ) ;
  
// Setting learning rates and create some utility matrices
Matrix epsilon_w ( num_hid , num_dims ) ; epsilon_w.fill( 0.001 ) ;   // undirected
Matrix epsilon_bi ( num_dims , 1 ) ; epsilon_bi.fill( 0.001 ) ;       // visibles
Matrix epsilon_bj ( num_hid , 1 ) ; epsilon_bj.fill( 0.001 ) ;        // hidden units
Matrix epsilon_A ( num_dims , num_dims ) ; epsilon_A.fill( 0.001 ) ;  // autoregressive
Matrix epsilon_B ( num_hid , num_dims ) ; epsilon_B.fill( 0.001 ) ;   // prev visibles to hidden
Matrix w_decay ( num_hid , num_dims ) ; w_decay.fill( 0.0002 ) ;      // currently we use the same weight decay for w, A, B
Matrix w_decay_A ( num_dims , num_dims ) ; w_decay_A.fill( 0.0002 ) ; // currently we use the same weight decay for w, A, B
Matrix w_decay_B ( num_hid , num_dims ) ; w_decay_B.fill( 0.0002 ) ;  // currently we use the same weight decay for w, A, B

Matrix momentum_w ( num_hid , num_dims ) ; momentum_w.fill( 0.0 ) ;   // momentum used only after 5 epochs of training, when it will be set to 0.9
Matrix num_cases_matrices_w_and_B ( num_hid , num_dims ) ; num_cases_matrices_w_and_B.fill( num_cases ) ;

Matrix momentum_bi ( num_dims , 1 ) ; momentum_bi.fill( 0.0 ) ;
Matrix num_cases_matrix_bi ( num_dims , 1 ) ; num_cases_matrix_bi.fill( num_cases ) ;

Matrix momentum_bj ( num_hid , 1 ) ; momentum_bj.fill( 0.0 ) ;
Matrix num_cases_matrix_bj ( num_hid , 1 ) ; num_cases_matrix_bj.fill( num_cases ) ;

Matrix momentum_A ( num_dims , num_dims ) ; momentum_A.fill( 0.0 ) ;
Matrix num_cases_matrix_A ( num_dims , num_dims ) ; num_cases_matrix_A.fill( num_cases ) ;

Matrix momentum_B ( num_hid , num_dims ) ; momentum_B.fill( 0.0 ) ;

// initialization of output matrices
Matrix w ( num_hid , num_dims ) ; w.fill( 0.0 ) ; // Octave code ---> w = 0.01 * randn( num_hid , num_dims ) ;
Matrix bi ( num_dims , 1 ) ; bi.fill( 0.0 ) ;     // Octave code ---> bi = 0.01 * randn( num_dims , 1 ) ;
Matrix bj( num_hid , 1 ) ; bj.fill( 0.0 ) ;       // Octave code ---> bj = -1 + 0.01 * randn( num_hid , 1 ) ; // set to favour units being "off"

// The autoregressive weights; A( : , : , j ) is the weight from t-j to the visible
NDArray A ( dim_vector( num_dims , num_dims , nt ) ) ; A.fill( 0.0 ) ; // Octave code ---> A = 0.01 * randn( num_dims ,num_dims , nt ) ;

// The weights from previous time-steps to the hiddens; B( : , : , j ) is the weight from t-j to the hidden layer
NDArray B ( dim_vector( num_hid , num_dims , nt ) ) ; B.fill( 0.0 ) ;  // Octave code ---> B = 0.01 * randn( num_hid , num_dims , nt ) ;

// Declare MersenneTwister random values
MTRand mtrand1 ;
double rand_norm_value ;
double rand_uniform_value ;

// nested loops to fill w, bi, bj, A and B with initial random values
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
    { 
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bj ( ii_nest_loop , 0 ) = -1.0 + rand_norm_value ; // set to favour units being "off"

     for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims ; jj_nest_loop++ )
         {
         rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
         w ( ii_nest_loop , jj_nest_loop ) = rand_norm_value ;
         }
    }
    
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
    {
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bi ( ii_nest_loop , 0 ) = rand_norm_value ;
    }

for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               A ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop
    
for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               B ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop        
    
// keep previous updates around for momentum
Matrix w_update ( num_hid , num_dims ) ; w_update.fill( 0.0 ) ; // Octave code ---> w_update = zeros( size( w ) ) ;
Matrix bi_update ( num_dims , 1 ) ; bi_update.fill( 0.0 ) ;     // Octave code ---> bi_update = zeros( size( bi ) ) ;
Matrix bj_update ( num_hid , 1 ) ; bj_update.fill( 0.0 ) ;      // Octave code ---> bj_update = zeros( size( bj ) ) ;

NDArray A_update ( dim_vector( num_dims , num_dims , nt ) ) ; A_update.fill( 0.0 ) ; // Octave code ---> A_update = zeros( size( A ) ) ;
Matrix A_extract ( num_dims , num_dims ) ; A_extract.fill( 0.0 ) ;                   // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix A_update_extract ( num_dims , num_dims ) ; A_update_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_update ( dim_vector( num_hid , num_dims , nt ) ) ; B_update.fill( 0.0 ) ;  // Octave code ---> B_update = zeros( size( B ) ) ;
Matrix B_extract ( num_hid , num_dims ) ; B_extract.fill( 0.0 ) ;                    // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix B_update_extract ( num_hid , num_dims ) ; B_update_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArray

// data is a nt+1 dimensional array with current and delayed data corresponding to mini-batches
// num_cases = minibatch( batch ).length () ;
num_cases = minibatch.rows () ;
NDArray data ( dim_vector( num_cases , num_dims , nt + 1 ) ) ; data.fill( 0.0 ) ; // Octave code ---> data = zeros( num_cases , num_dims , nt + 1 ) ;
Matrix data_extract ( num_cases , num_dims ) ; data_extract.fill( 0.0 ) ;         // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_transpose ( num_dims , num_cases ) ; data_transpose.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0 ( num_cases , num_dims ) ; data_0.fill( 0.0 ) ;                     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0_transpose ( num_dims , num_cases ) ; data_0_transpose.fill( 0.0 ) ; // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; A_grad.fill( 0.0 ) ;  // for A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;                                    
Matrix A_grad_extract ( num_dims , num_dims ) ; A_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; neg_A_grad.fill( 0.0 ) ; // for neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
Matrix neg_A_grad_extract ( num_dims , num_dims ) ; neg_A_grad_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; B_grad.fill( 0.0 ) ;          // for B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
Matrix B_grad_extract ( num_hid , num_dims ) ; B_grad_extract.fill( 0.0 ) ;              // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; neg_B_grad.fill( 0.0 ) ;  // for neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;
Matrix neg_B_grad_extract ( num_hid , num_dims ) ; neg_B_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

Array  p ( dim_vector ( nt , 1 ) , 0 ) ;                                // vector for writing to A_grad and B_grad

// end of initialization of matrices

// %%%%%%%%% THE MAIN CODE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for ( octave_idx_type epoch ( 0 ) ; epoch < num_epochs ; epoch++ ) // main epoch loop
    { 
//     // errsum = 0 ; % keep a running total of the difference between data and recon
//       
 for ( octave_idx_type batch ( 0 ) ; batch < minibatch.cols () ; batch++ ) // Octave code ---> int num_batches = minibatch.length () ;
     {
// %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%      

// // These next nested loops fill the data NDArray with the values from batchdata indexed
// // by the column values of minibatch. Octave code equivalent given below:-
// // Octave code ---> mb = minibatch{ batch } ; % caches the indices   
// // Octave code ---> data( : , : , 1 ) = batchdata( mb , : ) ;
// // Octave code ---> for hh = 1 : nt
// // Octave code ---> data( : , : , hh + 1 ) = batchdata( mb - hh , : ) ;
// // Octave code ---> end
     for ( octave_idx_type hh ( 0 ) ; hh < nt + 1 ; hh++ )
         {
    for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ )
        {
   for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
       {
       data( ii_nest_loop , jj_nest_loop , hh ) = batchdata( minibatch( ii_nest_loop , batch ) - 1 - hh , jj_nest_loop ) ; // -1 for .oct zero based indexing vs Octave's 1 based
       } // end of jj_nest_loop loop      
        }       // end of ii_nest_loop loop
         }             // end of hh loop

// The above data filling loop could perhaps be implemented more quickly using the fortrans vec method as below
// http://stackoverflow.com.80bola.com/questions/28900153/create-a-ndarray-in-an-oct-file-from-a-double-pointer
// NDArray a (dim_vector(dim[0], dim[1], dim[2]));
// 
// Then loop over (i, j, k) indices to copy the cube to the octave array
// 
// double* a_vec = a.fortran_vec ();
// for (int i = 0; i < dim[0]; i++) {
//     for (int j = 0; j < dim[1]; j++) {
//         for (int k = 0; k < dim[2]; k++) {
//             *a_vec++ = armadillo_cube(i, j, k);
//         }
//     }
// }

// calculate contributions from directed autoregressive connections and contributions from directed visible-to-hidden connections
            bi_star.fill( 0.0 ) ; // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; ( matrix declared earlier in code above )
            bj_star.fill( 0.0 ) ; // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; ( matrix declared earlier in code above )
            
            // The code below calculates two separate Octave code loops in one nested C++ loop structure, namely
            // Octave code ---> for hh = 1 : nt       
            //                      bi_star = bi_star +  A( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            // and
            // Octave code ---> for hh = 1:nt
            //                      bj_star = bj_star + B( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                // fill the intermediate calculation matrices 
                A_extract = A.page ( hh ) ;
                B_extract = B.page ( hh ) ;
                data_transpose = ( data.page ( hh + 1 ) ).transpose () ;

                // add up the hh different matrix multiplications
                bi_star += A_extract * data_transpose ; 
                bj_star += B_extract * data_transpose ;
     
         } // end of hh loop
         
                // extract and pre-calculate to save time in later computations
                data_0 = data.page ( 0 ) ;
                data_0_transpose = data_0.transpose () ;
                  
// Calculate "posterior" probability -- hidden state being on ( Note that it isn't a true posterior )   
//          Octave code ---> eta = w * ( data( : , : , 1 ) ./ gsd )' + ... % bottom-up connections
//                                 repmat( bj , 1 , num_cases ) + ...      % static biases on unit
//                                 bj_star ;                               % dynamic biases

//          get repmat( bj , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
         {
         repmat_bj.insert( bj , 0 , jj_nest_loop ) ; 
         }

            eta = w * data_0_transpose + repmat_bj + bj_star ;       
 
            // h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
            // -exponate the eta matrix
            for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
                { 
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
               {
               eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
               }
         }

            // element division  A./B == quotient(A,B) 
            h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Activate the hidden units    
//          Octave code ---> hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
     for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < hid_states.rows () ; ii_nest_loop++ )
         {
         for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < hid_states.cols () ; jj_nest_loop++ )
             {
      rand_uniform_value = mtrand1.randDblExc() ; // a real number in the range 0 to 1, excluding both 0 and 1
      hid_states( ii_nest_loop , jj_nest_loop ) = h_posteriors( jj_nest_loop , ii_nest_loop ) > rand_uniform_value ? 1.0 : 0.0 ;
             }
         } // end of hid_states loop

// Calculate positive gradients ( note w.r.t. neg energy )
// Octave code ---> w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
//                  bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
//                  bj_grad = sum( hid_states , 1 )' ;
            w_grad = hid_states.transpose () * data_0 ;

//          get repmat( bi , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
                {
                repmat_bi.insert( bi , 0 , jj_nest_loop ) ; 
                }  
            bi_grad = ( data_0_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
            bj_grad = ( hid_states.sum ( 0 ) ).transpose () ;
    
// Octave code ---> for hh = 1 : nt      
//                  A_grad( : , : , hh ) = ( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                  B_grad( : , : , hh ) = hid_states' * data( : , : , hh + 1 ) ;      
//                  end
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                p( 2 ) = hh ;                                                                    // set Array  p to write to page hh of A_grad and B_grad NDArrays
                data_extract = data.page ( hh + 1 ) ;                                            // get the equivalent of data( : , : , hh + 1 )
                A_grad.insert( ( data_0_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
                B_grad.insert( hid_states.transpose () * data_extract , p ) ;
                }
// the above code comes from http://stackoverflow.com/questions/29572075/how-do-you-create-an-arrayoctave-idx-type-in-an-oct-file
// e.g.
//
// Array p (dim_vector (3, 1));
// int n = 2;
// dim_vector dim(n, n, 3);
// NDArray a_matrix(dim);
// 
// for (octave_idx_type i = 0; i < n; i++)
//   for (octave_idx_type j = 0; j < n; j++)
//     a_matrix(i,j, 1) = (i + 1) * 10 + (j + 1);
// 
// std::cout << a_matrix;
// 
// Matrix b_matrix = Matrix (n, n);
// b_matrix(0, 0) = 1; 
// b_matrix(0, 1) = 2; 
// b_matrix(1, 0) = 3; 
// b_matrix(1, 1) = 4; 
// std::cout << b_matrix;
// 
// Array p (dim_vector (3, 1), 0);
// p(2) = 2;
// a_matrix.insert (b_matrix, p);
// 
// std::cout << a_matrix;

// %%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
 
// Activate the visible units
// Find the mean of the Gaussian
// Octave code ---> topdown = gsd .* ( hid_states * w ) ;
   topdown = hid_states * w ;
   
// This is the mean of the Gaussian. Instead of properly sampling, negdata is just the mean
// If we want to sample from the Gaussian, we would add in gsd .* randn( num_cases , num_dims ) ;
// Octave code ---> negdata = topdown + ...                       % top down connections
//                            repmat( bi' , num_cases , 1 ) + ... % static biases
//                            bi_star' ;                          % dynamic biases
   
// get repmat( bi' , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
   bi_transpose = bi.transpose () ;
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ ) // loop over the rows
       {
       repmat_bi_transpose.insert( bi_transpose , ii_nest_loop , 0 ) ; 
       } 
       negdata = topdown + repmat_bi_transpose + bi_star.transpose () ;

// Now conditional on negdata, calculate "posterior" probability for hiddens
// Octave code ---> eta = w * ( negdata ./ gsd )' + ...      % bottom-up connections
//                        repmat( bj , 1 , num_cases ) + ... % static biases on unit (no change)
//                        bj_star ;                          % dynamic biases (no change)

   negdata_transpose = negdata.transpose () ;             // to save repetition of transpose
   eta = w * negdata_transpose + repmat_bj + bj_star ;
   
// h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
// -exponate the eta matrix
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
       { 
        for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
            {
            eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
            }
       }

// element division  A./B == quotient(A,B) 
   h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Calculate negative gradients
// Octave code ---> neg_w_grad = h_posteriors * ( negdata ./ gsd ) ; % not using activations
   neg_w_grad = h_posteriors * negdata ; // not using activations
   
// Octave code ---> neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
   neg_bi_grad = ( negdata_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
   
// Octave code ---> neg_bj_grad = sum( h_posteriors , 2 ) ;
   neg_bj_grad = h_posteriors.sum ( 1 ) ;
   
// Octave code ---> for hh = 1 : nt
//                      neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                      neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;        
//                  end
   
   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;                                                                            // set Array  p to write to page hh of A_grad and B_grad NDArrays
       
       data_extract = data.page ( hh + 1 ) ;                                                    // get the equivalent of data( : , : , hh + 1 )
       neg_A_grad.insert( ( negdata_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
       neg_B_grad.insert( h_posteriors * data_extract , p ) ;
   
       } // end of hh loop   

// %%%%%%%%% END NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

// Octave code ---> err = sum( sum( ( data( : , : , 1 ) - negdata ) .^2 ) ) ;
// Not used         errsum = err + errsum ;

// Octave code ---> if ( epoch > 5 ) % use momentum
//                  momentum = mom ;
//                  else % no momentum
//                  momentum = 0 ;
//                  end

   // momentum was initialised to 0.0, but on the 6th iteration of epoch, set momentum to 0.9
   if ( epoch == 5 ) // will only be true once, after which momentum will == 0.9
      {
        momentum_w.fill( 0.9 ) ;
        momentum_bi.fill( 0.9 ) ;
        momentum_bj.fill( 0.9 ) ;
        momentum_A.fill( 0.9 ) ;
        momentum_B.fill( 0.9 ) ;
      }
     
// %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%        

// Octave code ---> w_update = momentum * w_update + epsilon_w * ( ( w_grad - neg_w_grad ) / num_cases - w_decay * w ) ;
   w_update = product( momentum_w , w_update ) + product( epsilon_w , quotient( ( w_grad - neg_w_grad ) , num_cases_matrices_w_and_B ) - product( w_decay , w ) ) ;
   
// Octave code ---> bi_update = momentum * bi_update + ( epsilon_bi / num_cases ) * ( bi_grad - neg_bi_grad ) ;
   bi_update = product( momentum_bi , bi_update ) + product( quotient( epsilon_bi , num_cases_matrix_bi ) , ( bi_grad - neg_bi_grad ) ) ;
   
// Octave code ---> bj_update = momentum * bj_update + ( epsilon_bj / num_cases ) * ( bj_grad - neg_bj_grad ) ;
   bj_update = product( momentum_bj , bj_update ) + product( quotient( epsilon_bj , num_cases_matrix_bj ) , ( bj_grad - neg_bj_grad ) ) ;
 
// The following two Octave code loops are combined into the single .oct loop that follows them
//   
// Octave code ---> for hh = 1 : nt
//                      A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;
//                      B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
//                  end
   
// Octave code ---> for hh = 1 : nt
//                      A( : , : , hh ) = A( : , : , hh ) + A_update( : , : , hh ) ;
//                      B( : , : , hh ) = B( : , : , hh ) + B_update( : , : , hh ) ;
//                  end   

   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;
   
       A_update_extract = A_update.page ( hh ) ;
       A_grad_extract = A_grad.page ( hh ) ;
       neg_A_grad_extract = neg_A_grad.page ( hh ) ;
       A_extract = A.page ( hh ) ;
       A_update.insert( product( momentum_A , A_update_extract ) + product( epsilon_A , ( quotient( ( A_grad_extract - neg_A_grad_extract ) , num_cases_matrix_A ) - product( w_decay_A , A_extract ) ) ) , p ) ;
       A_update_extract = A_update.page ( hh ) ;
       A.insert( A_extract + A_update_extract , p ) ;
   
       B_update_extract = B_update.page ( hh ) ;
       B_grad_extract = B_grad.page ( hh ) ;
       neg_B_grad_extract = neg_B_grad.page ( hh ) ;
       B_extract = B.page ( hh ) ;
       B_update.insert( product( momentum_B , B_update_extract ) + product( epsilon_B , ( quotient( ( B_grad_extract - neg_B_grad_extract ) , num_cases_matrices_w_and_B ) - product( w_decay_B , B_extract ) ) ) , p ) ;
       B_update_extract = B_update.page ( hh ) ;
       B.insert( B_extract + B_update_extract , p ) ;
       
       } // end of hh loop
  
// Octave code ---> w = w + w_update ;
//                  bi = bi + bi_update ;
//                  bj = bj + bj_update ;

   w += w_update ;
   bi += bi_update ;
   bj += bj_update ;
       
// %%%%%%%%%%%%%%%% END OF UPDATES  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            } // end of batch loop

    } // end of main epoch loop

// return the values w , bj , bi , A , B
retval_list(4) = B ;
retval_list(3) = A ;
retval_list(2) = bi ;    
retval_list(1) = bj ;
retval_list(0) = w ;

return retval_list ;
  
} // end of function
The code is heavilly commented throughout.The .oct code for the binary_crbm.m function will follow in due course.

Tuesday, 31 March 2015

Conditional Restricted Boltzmann Machine

I have recently been looking at using a Conditional Restricted Boltzmann Machine, and in particular Graham Taylor's thesis paper, Composable Distributed-State Models for High-Dimensional Time Series and the associated code available from here. My adaptation of this is very much a work in progress, but I think it holds great promise.

Instead of using the "label" of my MFE/MAE indicator as determined by the matches found by my Cauchy Schwarz matching algorithm, I'm thinking of using the above CRBM to directly model the price action that immediately follows said matches, use the generative ability of the CRBM to create a set of modelled price bars and then apply the MFE/MAE indicator to these modelled price bars. In this way, I should be able to generate a sort of distribution of future MFE/MAE values as input for making trading decisions.

In my adaptation of the above linked code I have replaced the somewhat complicated motion capture input data with a very simple set of 5 features of the log changes in price between consecutive bars, and also made the code easier to read by spacing it out, using underscores to separate compound word named variables, added some extra comments and renamed some variables to be more relevant to the intended use. I have also split the code into separately callable functions, with an eye on future coding into .oct C++ functions. However, in essence, the code is much the same as is downloadable from the above code link. The code blocks below show my code adaptations so far:-
first, the basic training file
clear all

% load price file of interest
filename = input( 'Enter filename for prices, e.g. es or esmatrix: ' , 's' ) ;
data = load( "-ascii" , filename ) ;

% get tick size
switch filename

case { "cc" }
tick = 1 ;

case { "gc" "lb" "pl" "sm" "sp" }
tick = 0.1 ;

case { "ausyen" "bo" "cl" "ct" "dx" "euryen" "gbpyen" "sb" "usdyen" }
tick = 0.01 ;

case { "ng" }
tick = 0.001 ;

case { "auscad" "aususd" "euraus" "eurcad" "eurchf" "eurgbp" "eurusd" "gbpchf" "gbpusd" "ho" "rb" "usdcad" "usdchf" }
tick = 0.0001 ;

case { "c" "o" "s" "es" "nd" "w" }
tick = 0.25 ;

case { "fc" "lc" "lh" "pb" }
tick = 0.025 ;

case { "ed" }
tick = 0.0025 ;

case { "si" }
tick = 0.5 ;

case { "hg" "kc" "oj" "pa" }
tick = 0.05 ;

case { "ty" "us" }
tick = 0.015625 ;

case { "ccmatrix" }
tick = 1 ;

case { "gcmatrix" "lbmatrix" "plmatrix" "smmatrix" "spmatrix" }
tick = 0.1 ;

case { "ausyenmatrix" "bomatrix" "clmatrix" "ctmatrix" "dxmatrix" "euryenmatrix" "gbpyenmatrix" "sbmatrix" "usdyenmatrix" }
tick = 0.01 ;

case { "ngmatrix" }
tick = 0.001 ;

case { "auscadmatrix" "aususdmatrix" "eurausmatrix" "eurcadmatrix" "eurchfmatrix" "eurgbpmatrix" "eurusdmatrix" "gbpchfmatrix" "gbpusdmatrix" "homatrix" "rbmatrix" "usdcadmatrix" "usdchfmatrix" }
tick = 0.0001 ;

case { "cmatrix" "omatrix" "smatrix" "esmatrix" "ndmatrix" "wmatrix" }
tick = 0.25 ;

case { "fcmatrix" "lcmatrix" "lhmatrix" "pbmatrix" }
tick = 0.025 ;

case { "edmatrix" }
tick = 0.0025 ;

case { "simatrix" }
tick = 0.5 ;

case { "hgmatrix" "kcmatrix" "ojmatrix" "pamatrix" }
tick = 0.05 ;

case { "tymatrix" "usmatrix" }
tick = 0.015625 ;

endswitch

%  get data
open = data( : , 4 ) ;
high = data( : , 5 ) ;
low = data( : , 6 ) ;
close = data( : , 7 ) ;

clear data

%  create log series for crbm training
vwap = vwap_all_period_detrend( open , high , low , close , tick ) ;
open_from_vwap = log( open ./ shift( vwap , 1 ) ) ; open_from_vwap(1) = 0 ;
open_from_previous_close = log( open ./ shift( close , 1 ) ) ; open_from_previous_close(1) = 0 ;
high_from_vwap = log( high ./ vwap ) ; high_from_vwap(1) = 0 ;
low_from_vwap = log( low ./ vwap ) ; low_from_vwap(1) = 0 ;
close_from_vwap = log( close ./ vwap ) ; close_from_vwap(1) = 0 ;

%  assemble into a features matrix
features = [ open_from_vwap open_from_previous_close high_from_vwap low_from_vwap close_from_vwap ] ;

%  now prepare data for the NN training
%  visually inspect chart to choose a training target
pkg load financial
clf () ;
figure(1) ; highlow( high , low , close , open ) ;
training_point_index = input( 'Choose a training point for NN training by entering an index from figure 1: ' ) ;
bar_number = 0 ;

%  load the pre-computed top 500 Cauchy Schwarz Cross validation index matches
load eurusd_top_500_cv_matches

%  select how many of the above matches to use - an optimisable parameter via pso?
number_of_training_examples_to_use = 450 ;

for ii = 0 : 2
%  bar_number = input( 'Enter a 1, 2 or 3 to identify the bar sequence order' ) ;
bar_number = bar_number + 1 ;

% using training_point_index get the top number_of_training_examples_to_use matches
features_index = eurusd_top_500_cv_matches( training_point_index+ii , 1 : number_of_training_examples_to_use )' ;

%  use the above features_index to create batchdata for training, assembled into form suitable for crbm training
%  how-many timesteps do we look back for directed connections - this is what we call the "order" of the model 
n1 = 3 ; % first layer
n2 = 3 ; % second layer

%  create batchdataindex to index bars that include bars at timesteps t-2, t-1, t and t+1 where t is the bar at features_index
batchdataindex = [ features_index ; features_index.- 2 ; features_index.-1 ; features_index.+1 ] ;
%  sorted_batchdata_index = sort( batchdataindex ) ;

%  shuffle training set by shuffling the index values
batchdataindex = batchdataindex( randperm( size( batchdataindex , 1 ) ) , : ) ;

%  normalise the features matrix by the mean and std of just the batchdataindex values in features matrix
data_mean = mean( features( batchdataindex ) , 1 ) ;
data_std = std( features( batchdataindex ) ) ;
batchdata = ( features - repmat( data_mean , size( features , 1 ) , 1 ) ) ./ repmat( data_std , size( features , 1 ) , 1 ) ;

if ( bar_number == 1 )
data_mean_1 = data_mean ; data_std_1 = data_std ;
save bar_1_mean_std.mat data_mean_1 data_std_1 ;
elseif ( bar_number == 2 )
data_mean_2 = data_mean ; data_std_2 = data_std ;
save bar_2_mean_std.mat data_mean_2 data_std_2 ;
elseif ( bar_number == 3 )
data_mean_3 = data_mean ; data_std_3 = data_std ;
save bar_3_mean_std.mat data_mean_3 data_std_3 ;
else
fprintf( 'Have entered wrong value for bar number.\n' ) ;
end

%  now create minibatches from the batchdataindex in the form of a cell array
batchsize = 100 ;
minibatchindex = reshape( batchdataindex , size( batchdataindex , 1 ) / batchsize , batchsize ) ;
minibatch = num2cell( minibatchindex , 2 ) ;

%  Set network properties
numdims = size( features , 2 ) ;
%  numhid1 = 150 ; numhid2 = 150 ; numepochs = 2000 ; % original
numhid1 = 100 ; numhid2 = 100 ; numepochs = 2000 ;
gsd = 1 ;          % fixed standard deviation for Gaussian units
nt = n1 ;          % crbm "order"
numhid = numhid1 ;
%  fprintf( 1 , 'Training Layer 1 CRBM, order %d: %d-%d \n', nt , numdims , numhid ) ;
restart = 1 ;      % initialize weights
tic() ;
[ w1 , bj1 , bi1 , A1 , B1 ] = gaussian_crbm( batchdata , minibatch , nt , gsd , numepochs , numhid , restart ) ;
gaussian_crbm_training_time = toc()

if ( bar_number == 1 )
w11 = w1 ; bj11 = bj1 ; bi11 = bi1 ; A11 = A1 ; B11 = B1 ;
save layer1.mat w11 bj11 bi11 A11 B11 ;
elseif ( bar_number == 2 )
w12 = w1 ; bj12 = bj1 ; bi12 = bi1 ; A12 = A1 ; B12 = B1 ;
save layer1_2.mat w12 bj12 bi12 A12 B12 ;
elseif ( bar_number == 3 )
w13 = w1 ; bj13 = bj1 ; bi13 = bi1 ; A13 = A1 ; B13 = B1 ;
save layer1_3.mat w13 bj13 bi13 A13 B13 ;
else
fprintf( 'Have entered wrong value for bar number.\n' ) ;
end

tic() ;
[ filtering_distance ] = get_filtering_distance( batchdata , unique( batchdataindex ) , numhid , w1 , bj1 , n1 , n2 , B1 , gsd ) ;
get_filtering_distance_timing = toc()

%  now use filtering_distance as new features for training of second layer
%  first create a new batchdata set
batchdata_2 = zeros( size( features , 1 ) , size( filtering_distance , 2 ) ) ;
%  and fill with the relevant, indexed values from filtering_distance
batchdata_2( unique( batchdataindex ) , : ) = filtering_distance ;

%  create new minibatch to index batchdata_2
batchsize = 25 ;
minibatchindex = reshape( features_index .+ 1 , size( features_index , 1 ) / batchsize , batchsize ) ;
minibatch = num2cell( minibatchindex , 2 ) ;

%  set network properties for the 2nd layer training
numhid = numhid2 ; nt = n2 ;
numdims = size( batchdata_2 , 2 ) ; % data (visible) dimension
%  fprintf( 1 , 'Training Layer 2 CRBM, order %d: %d-%d \n' , nt , numdims , numhid ) ;
restart = 1 ; % initialize weights
tic() ;
[ w2 , bj2 , bi2 , A2 , B2 ] = binary_crbm( batchdata_2 , minibatch , nt , numepochs , numhid , restart ) ;
binary_crbm_training_time = toc()

if ( bar_number == 1 )
w21 = w2 ; bj21 = bj2 ; bi21 = bi2 ; A21 = A2 ; B21 = B2 ;
save layer2.mat w21 bj21 bi21 A21 B21 ;
elseif ( bar_number == 2 )
w22 = w2 ; bj22 = bj2 ; bi22 = bi2 ; A22 = A2 ; B22 = B2 ;
save layer2_2.mat w22 bj22 bi22 A22 B22 ;
elseif ( bar_number == 3 )
w23 = w2 ; bj23 = bj2 ; bi23 = bi2 ; A23 = A2 ; B23 = B2 ;
save layer2_3.mat w23 bj23 bi23 A23 B23 ;
else
fprintf( 'Have entered wrong value for bar number.\n' ) ;
end

end % end of training_point_index loop
which in turn calls gaussian_crbm.m
function [ w , bj , bi , A , B ] = gaussian_crbm( batchdata , minibatch , nt , gsd , num_epochs , num_hid , restart )

%  Version 1.01 
%  %
%  Code provided by Graham Taylor, Geoff Hinton and Sam Roweis 
%  %
%  For more information, see:
%     http://www.cs.toronto.edu/~gwtaylor/publications/nips2006mhmublv
%  %
%  Permission is granted for anyone to copy, use, modify, or distribute this
%  program and accompanying programs and documents for any purpose, provided
%  this copyright notice is retained and prominently displayed, along with
%  a note saying that the original programs are available from our
%  web page.
%  The programs and documents are distributed without any warranty, express or
%  implied.  As the programs were written for research purposes only, they have
%  not been tested to the degree that would be advisable in any important
%  application.  All use of these programs is entirely at the user's own risk.
%  
%  This program trains a Conditional Restricted Boltzmann Machine in which
%  visible, Gaussian-distributed inputs are connected to
%  hidden, binary, stochastic feature detectors using symmetrically
%  weighted connections. Learning is done with 1-step Contrastive Divergence.
%  Directed connections are present, from the past nt configurations of the
%  visible units to the current visible units (A), and the past nt
%  configurations of the visible units to the current hidden units (B)
%  
%  The program assumes that the following variables are set externally:
%  nt        -- order of the model
%  gsd       -- fixed standard deviation of Gaussian visible units
%  num_epochs -- maximum number of epochs
%  num_hid    --  number of hidden units 
%  batchdata --  a matrix of data (num_cases,num_dims) 
%  minibatch -- a cell array of dimension batchsize, indexing the valid
%  frames in batchdata
%  restart   -- set to 1 if learning starts from beginning 

%  batchdata is a big matrix of all the frames
%  we index it with "minibatch", a cell array of mini-batch indices
num_batches = length( minibatch ) ; 

num_dims = size( batchdata , 2 ) ; % visible dimension

%  Setting learning rates
epsilon_w = 1e-3 ; % undirected
epsilon_bi = 1e-3 ; % visibles
epsilon_bj = 1e-3 ; % hidden units
epsilon_A = 1e-3 ; % autoregressive
epsilon_B = 1e-3 ; % prev visibles to hidden

w_decay = 0.0002 ; % currently we use the same weight decay for w, A, B
mom = 0.9 ;       % momentum used only after 5 epochs of training

if ( restart == 1 )  
    restart = 0 ;
    epoch = 1 ;

    % Randomly initialize weights
    w = 0.01 * randn( num_hid , num_dims ) ;
    bi = 0.01 * randn( num_dims , 1 ) ;
    bj = -1 + 0.01 * randn( num_hid , 1 ) ; % set to favour units being "off"

    % The autoregressive weights; A(:,:,j) is the weight from t-j to the vis
    A = 0.01 * randn( num_dims ,num_dims ,nt ) ;

    % The weights from previous time-steps to the hiddens; B(:,:,j) is the
    % weight from t-j to the hidden layer
    B = 0.01 * randn( num_hid , num_dims , nt ) ;

    clear w_grad bi_grad bj_grad A_grad B_grad
    clear neg_w_grad neg_bi_grad neg_bj_grad neg_A_grad neg_B_grad

    % keep previous updates around for momentum
    w_update = zeros( size( w ) ) ;
    bi_update = zeros( size( bi ) ) ;
    bj_update = zeros( size( bj ) ) ;
    A_update = zeros( size( A ) ) ;
    B_update = zeros( size( B ) ) ;
    
end % end of restart if

% Main loop
for epoch = epoch : num_epochs

  errsum = 0 ; % keep a running total of the difference between data and recon
  
  for batch = 1 : num_batches     
   
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%      
    num_cases = length( minibatch{ batch } ) ;
    mb = minibatch{ batch } ; % caches the indices   

    % data is a nt+1-d array with current and delayed data
    % corresponding to this mini-batch
    data = zeros( num_cases , num_dims , nt + 1 ) ;
    data( : , : , 1 ) = batchdata( mb , : ) ;
    for hh = 1 : nt
        data( : , : , hh + 1 ) = batchdata( mb - hh , : ) ;
    end
     
    % calculate contributions from directed autoregressive connections
    bi_star = zeros( num_dims , num_cases ) ; 
    for hh = 1 : nt       
        bi_star = bi_star +  A( : , : , hh ) * data( : , : , hh + 1 )' ;
    end   
    
    % Calculate contributions from directed visible-to-hidden connections
    bj_star = zeros( num_hid , num_cases ) ;
    for hh = 1:nt
        bj_star = bj_star + B( : , : , hh ) * data( : , : , hh + 1 )' ;
    end
      
    % Calculate "posterior" probability -- hidden state being on 
    % N ote that it isn't a true posterior   
    eta = w * ( data( : , : , 1 ) ./ gsd )' + ... % bottom-up connections
          repmat( bj , 1 , num_cases ) + ...      % static biases on unit
          bj_star ;                               % dynamic biases
    
    h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
    
    % Activate the hidden units    
    hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ; 
    
    % Calculate positive gradients (note w.r.t. neg energy)
    w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
    bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
    bj_grad = sum( hid_states , 1 )' ;
           
    for hh = 1 : nt      
        A_grad( : , : , hh ) = ( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
        B_grad( : , : , hh ) = hid_states' * data( : , : , hh + 1 ) ;      
    end
    
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
    % Activate the visible units
    % Find the mean of the Gaussian
    topdown = gsd .* ( hid_states * w ) ;

    % This is the mean of the Gaussian 
    % Instead of properly sampling, negdata is just the mean
    % If we want to sample from the Gaussian, we would add in
    % gsd .* randn( num_cases , num_dims ) ;
    negdata = topdown + ...                       % top down connections
              repmat( bi' , num_cases , 1 ) + ... % static biases
              bi_star' ;                          % dynamic biases

    %N ow conditional on negdata, calculate "posterior" probability
    % for hiddens
    eta = w * ( negdata ./ gsd )' + ...      % bottom-up connections
          repmat( bj , 1 , num_cases ) + ... % static biases on unit (no change)
          bj_star ;                          % dynamic biases (no change)

    h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ; % logistic
    
    % Calculate negative gradients
    neg_w_grad = h_posteriors * ( negdata ./ gsd ) ; % not using activations
    neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
    neg_bj_grad = sum( h_posteriors , 2 ) ;
    
    for hh = 1 : nt
        neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
        neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;        
    end

%%%%%%%%% END NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%    
    
    err = sum( sum( ( data( : , : , 1 ) - negdata ) .^2 ) ) ;
    errsum = err + errsum ;
    
    if ( epoch > 5 ) % use momentum
        momentum = mom ;
    else % no momentum
        momentum = 0 ;
    end
    
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%        
    
    w_update = momentum * w_update + epsilon_w * ( ( w_grad - neg_w_grad ) / num_cases - w_decay * w ) ;
    bi_update = momentum * bi_update + ( epsilon_bi / num_cases ) * ( bi_grad - neg_bi_grad ) ;
    bj_update = momentum * bj_update + ( epsilon_bj / num_cases ) * ( bj_grad - neg_bj_grad ) ;

    for hh = 1 : nt
        A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;
        B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
    end

    w = w + w_update ;
    bi = bi + bi_update ;
    bj = bj + bj_update ;

    for hh = 1 : nt
        A( : , : , hh ) = A( : , : , hh ) + A_update( : , : , hh ) ;
        B( : , : , hh ) = B( : , : , hh ) + B_update( : , : , hh ) ;
    end
    
%%%%%%%%%%%%%%%% END OF UPDATES  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
  
  end
    
%    % every 10 epochs, show output
%    if ( mod( epoch , 10 ) == 0 )
%       fprintf( 1 , 'epoch %4i error %6.1f  \n', epoch , errsum ) ; 
%       % Could see a plot of the weights every 10 epochs
%       % figure(3); weightreport
%       % drawnow;
%    end
    
end % end of epoch for loop
    
end % end of function
get_filtering_distance.m
function [ filtering_distance ] = get_filtering_distance( batchdata , batchdataindex , num_hid , w , bj , n1 , n2 , B , gsd )

% Version 1.01 
%
% Code provided by Graham Taylor, Geoff Hinton and Sam Roweis 
%
% For more information, see:
%     http://www.cs.toronto.edu/~gwtaylor/publications/nips2006mhmublv
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

% This program runs the entire training set on the trained 1-hidden-layer
% network and saves the filtering distribution vectors in mini-batch format
% This is done before training a second CRBM on top of the first

% The program assumes that the following variables are set externally:
% n2    -- order of the next layer CRBM

%  batchsize = 100 ; % size of mini-batches

%  take all valid examples ( indexed by batchdataindex )
num_cases = length( batchdataindex ) ;

%  Calculate contributions from directed visible-to-hidden connections
bj_star = zeros( num_hid , num_cases ) ;

for hh = 1 : n1

  bj_star = bj_star + B( : , : , hh ) * batchdata( batchdataindex - hh , : )' ;
  
end % end for loop

%  Calculate "posterior" probability -- hidden state being on
%  Note that it isn't a true posterior
bottom_up = w * ( batchdata( batchdataindex , : ) ./ gsd )' ;

eta = bottom_up + ...                    % bottom-up connections
      repmat( bj , 1 , num_cases ) + ... % static biases on unit
      bj_star ;                          % dynamic biases

filtering_distance = 1 ./ ( 1 + exp( -eta' ) ) ; % logistic

end % end of function
binary_crbm.m
function [ w , bj , bi , A , B ] = binary_crbm( input_data , minibatch , nt , num_epochs , num_hid , restart )

% Version 1.01 
%
% Code provided by Graham Taylor, Geoff Hinton and Sam Roweis 
%
% For more information, see:
%     http://www.cs.toronto.edu/~gwtaylor/publications/nips2006mhmublv
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.

% This program trains a Conditional Restricted Boltzmann Machine in which
% visible, binary, stochastic inputs are connected to
% hidden, binary, stochastic feature detectors using symmetrically
% weighted connections. Learning is done with 1-step Contrastive Divergence.
% Directed connections are present, from the past nt configurations of the
% visible units to the current visible units (A), and the past nt
% configurations of the visible units to the current hidden units (B)

% The program assumes that the following variables are set externally:
% nt        -- order of the model
% num_epochs -- maximum number of epochs
% num_hid    --  number of hidden units 
% input_data --  a matrix of data (num_cases,num_dims) 
% minibatch -- a cell array of dimension batchsize, indexing the valid
% frames in input_data
% restart   -- set to 1 if learning starts from beginning 

% input_data is a big matrix of all the frames
% we index it with "minibatch", a cell array of mini-batch indices
num_batches = length( minibatch ) ; 

num_dims = size( input_data , 2 ) ; % visible dimension

%  Setting learning rates
epsilon_w = 1e-3 ; % undirected
epsilon_bi = 1e-3 ; % visibles
epsilon_bj = 1e-3 ; % hidden units
epsilon_A = 1e-3 ; % autoregressive
epsilon_B = 1e-3 ; % prev visibles to hidden

w_decay = 0.0002 ; % currently we use the same weight decay for w, A, B
mom = 0.9 ;        % momentum used only after 5 epochs of training

if ( restart == 1 )  
   restart = 0 ;
   epoch = 1 ;
  
  % Randomly initialize weights
  w = 0.01 * randn( num_hid , num_dims ) ;
  bi = 0.01 * randn( num_dims , 1 ) ;
  bj = -1 + 0.01 * randn( num_hid , 1 ) ; % set to favour units being "off"
  
  % The autoregressive weights; A(:,:,j) is the weight from t-j to the vis
  A = 0.01 * randn( num_dims , num_dims , nt ) ;
 
  % The weights from previous time-steps to the hiddens; B(:,:,j) is the
  % weight from t-j to the hidden layer
  B = 0.01 * randn( num_hid , num_dims , nt ) ;
  
  clear w_grad bi_grad bj_grad A_grad B_grad
  clear neg_w_grad neg_bi_grad neg_bj_grad neg_A_grad neg_B_grad
  
  % keep previous updates around for momentum
  w_update = zeros( size( w ) ) ;
  bi_update = zeros( size( bi ) ) ;
  bj_update = zeros( size( bj ) ) ;
  A_update = zeros( size( A ) ) ;
  B_update = zeros( size( B ) ) ;
    
end % end of restart if

% Main loop
for epoch = epoch : num_epochs

  errsum = 0 ; % keep a running total of the difference between data and recon
  
  for batch = 1 : num_batches     

%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
    num_cases = length( minibatch{ batch } ) ;
    mb = minibatch{ batch } ; % caches the indices   

    % data is a nt+1-d array with current and delayed data
    % corresponding to this mini-batch
    data = zeros( num_cases , num_dims , nt + 1 ) ;
    data( : , : , 1 ) = input_data( mb , : ) ;
    for hh = 1 : nt
        data( : , : , hh + 1 ) = input_data( mb - hh , : ) ;
    end

    % Calculate contributions from directed autoregressive connections
    bi_star = zeros( num_dims , num_cases ) ; 
    for hh = 1 : nt       
        bi_star = bi_star +  A( : , : , hh ) * data( : , : , hh + 1 )' ;
    end   
    
    % Calculate contributions from directed visible-to-hidden connections
    bj_star = zeros( num_hid , num_cases ) ;
    for hh = 1 : nt
        bj_star = bj_star + B( : , : , hh ) * data( : , : , hh + 1 )' ;
    end
     
    bottom_up = w * data( : , : , 1 )' ;
    
    % Calculate "posterior" probability -- hidden state being on 
    % Note that it isn't a true posterior   
    eta = bottom_up + ...                    % bottom-up connections
          repmat( bj , 1 , num_cases ) + ... % static biases on unit
          bj_star ;                          % dynamic biases
    
    h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ; % logistic
       
    % Activate the hidden units    
    hid_states = h_posteriors' > rand( num_cases , num_hid ) ; 
    
    % Calculate positive gradients (note w.r.t. neg energy)
    w_grad = hid_states' * data( : , : , 1 ) ;
    bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ;
    bj_grad = sum( hid_states , 1 )' ;
           
    for hh = 1 : nt      
        A_grad( : , : , hh ) = ( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star ) * data( : , : , hh + 1 ) ;
        B_grad( : , : , hh ) = hid_states' * data( : , : , hh + 1 ) ;      
    end
    
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%% START NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%    
    
    % Activate the visible units
    topdown = hid_states * w ;

    eta = topdown + ...                       % top down connections
          repmat( bi' , num_cases , 1 ) + ... % static biases
          bi_star' ;                          % dynamic biases
    
    negdata = 1 ./ ( 1 + exp( -eta ) ) ; % logistic
    
    % Now conditional on negdata, calculate "posterior" probability
    % for hiddens
    bottom_up = w * negdata' ;
    eta = bottom_up + ...                    % bottom-up connections
          repmat( bj , 1 , num_cases ) + ... % static biases on unit (no change)
          bj_star ;                          % dynamic biases (no change)

    h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ; %logistic
    
    % Calculate negative gradients
    neg_w_grad = h_posteriors * negdata ; % not using activations
    neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ;
    neg_bj_grad = sum( h_posteriors , 2 ) ;
    
    for hh = 1 : nt
        neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) * data( : , : , hh + 1 ) ;
        neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;        
    end
   
%%%%%%%%% END NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
    err = sum( sum( ( data( : , : , 1 ) - negdata ) .^2 ) ) ;
    errsum = err + errsum ;
    
    if ( epoch > 5 ) % use momentum
        momentum = mom ;
    else % no momentum
        momentum = 0 ;
    end
    
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
    w_update = momentum * w_update + epsilon_w * ( ( w_grad - neg_w_grad ) / num_cases - w_decay * w ) ;
    bi_update = momentum * bi_update + ( epsilon_bi / num_cases ) * ( bi_grad - neg_bi_grad ) ;
    bj_update = momentum * bj_update + ( epsilon_bj / num_cases ) * ( bj_grad - neg_bj_grad ) ;

    for hh = 1 : nt
        A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;
        B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
    end

    w = w +  w_update ;
    bi = bi + bi_update ;
    bj = bj + bj_update ;

    for hh = 1 : nt
        A( : , : , hh ) = A( : , : , hh ) + A_update( : , : , hh ) ;
        B( : , : , hh ) = B( : , : , hh ) + B_update( : , : , hh ) ;
    end
    
%%%%%%%%%%%%%%%% END OF UPDATES  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
  end
    
%    % every 10 epochs, show output
%    if ( mod( epoch , 10 ) == 0 )
%        fprintf( 1 , 'epoch %4i error %6.1f  \n' , epoch , errsum ) ; 
%        % Could see a plot of the weights every 10 epochs
%        % figure(3); weightreport
%        % drawnow;
%    end
    
end % end main loop
    
end % end of function
generates new data with generate_visible_data.m
function [ visible_data  ] = generate_visible_data( init_data , hidden_1 , num_hid_1 , num_hid_2 , w1 , w2 , bj1 , bi1 , bj2 , bi2 , A1 , A2 , B1 , B2 , n1 , n2 , gsd )

% Version 1.01 
%
% Code provided by Graham Taylor, Geoff Hinton and Sam Roweis 
%
% For more information, see:
%     http://www.cs.toronto.edu/~gwtaylor/publications/nips2006mhmublv
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied.  As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application.  All use of these programs is entirely at the user's own risk.
%
% This program uses the 2-level CRBM to generate data
% More efficient version than the original

num_frames = 1 ;
max_clamped = n1 + n2 ;
A2_flat = reshape( A2 , num_hid_1 , n2 * num_hid_1 ) ;
B2_flat = reshape( B2 , num_hid_2 , n2 * num_hid_1 ) ;

num_Gibbs = 30 ; % number of alternating Gibbs iterations

%  initialize second layer ( first n1 + n2 frames padded )
hidden_2 = ones( num_frames , num_hid_2 ) ;

%  keep the recent past in vector form
%  for input to directed links
%  it's slightly faster to pre-compute this vector and update it ( by
%  shifting ) at each time frame, rather than to fully recompute each time
%  frame
past = reshape( hidden_1( max_clamped : -1 : max_clamped + 1 - n2 , : )' , num_hid_1 * n2 , 1 ) ;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  First generate a hidden sequence (top layer)
%  Then go down through the first CRBM
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

fprintf( 'Generating hidden states\n' ) ;

%  for tt = max_clamped + 1 : num_frames
  tt = size( hidden_1 , 1 ) ;

  % initialize using the last frame
  % noise is not added for binary units
  hidden_1( tt , : ) = hidden_1( tt - 1 , : ) ;
  
  % Dynamic biases aren't re-calculated during Alternating Gibbs
  bi_star = A2_flat * past ;
  bj_star = B2_flat * past ;

  % Gibbs sampling
  for gg = 1 : num_Gibbs
  
    % Calculate posterior probability -- hidden state being on (estimate)
    % add in bias
    bottomup = w2 * hidden_1( tt , : )' ;
               eta = bottomup + ...       % bottom-up connections
               bj2 + ...                  % static biases on unit
               bj_star ;                  % dynamic biases
    
    hposteriors = 1 ./ ( 1 + exp( -eta ) ) ; % logistic
    
    hidden_2( tt , : ) = double( hposteriors' > rand( 1 , num_hid_2 ) ) ; % Activating hiddens
    
    % Downward pass; visibles are binary logistic units     
    topdown = hidden_2( tt , : ) * w2 ;
        
    eta = topdown + ... % top down connections
          bi2' + ...    % static biases
          bi_star';     % dynamic biases   
    
    hidden_1( tt , : ) = 1 ./ ( 1 + exp( -eta ) ) ; % logistic 
    
  end % end gibbs sampling

  % If we are done Gibbs sampling, then do a mean-field sample
  
  topdown = hposteriors' * w2 ; % Very noisy if we don't do mean-field here

  eta = topdown + ... % top down connections
        bi2' + ...    % static biases
        bi_star';     % dynamic biases
        hidden_1( tt , : ) = 1 ./ ( 1 + exp( -eta ) ) ;

  % update the past
%    past( num_hid_1 + 1 : end ) = past( 1 : end - num_hid_1 ) ; % shift older history down
%    past( 1 : num_hid_1 ) = hidden_1( tt , : ) ;                % place most recent frame at top
  
  
%    if ( mod( tt , 10 ) == 0 )
%       fprintf( 'Finished frame %d\n' , tt ) ;
%    end
  
%  end % end tt loop

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Now that we've decided on the "filtering distribution", generate visible
%  data through CRBM 1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

fprintf( 'Generating visible data\n' ) ;

%  for tt = max_clamped + 1 : num_frames

    % Add contributions from autoregressive connections
%      bi_star = zeros( num_dims , 1 ) ; % original
    bi_star = zeros( size( init_data , 2 ) , 1 ) ;
    for hh = 1 : n1
%          bi_star = bi_star + A1( : , : , hh ) * visible( tt - hh , : )' ; % original
        bi_star = bi_star + A1( : , : , hh ) * init_data( tt - hh , : )' ;
    end

    % Mean-field approx; visible units are Gaussian
    % ( filtering distribution is the data we just generated )
    topdown = gsd .* ( hidden_1( tt , : ) * w1 ) ;
%      visible( tt , : ) = topdown + ... % top down connections ( original )
    visible_data = topdown + ...
                   bi1' + ...    % static biases
                   bi_star' ;    % dynamic biases
                        
%  end % end tt loop

end % end of function
and then finally plots this new data as price bars, crbm_visual_test.m
clear all

% load price file of interest
filename = input( 'Enter filename for prices, e.g. es or esmatrix: ' , 's' ) ;
data = load( "-ascii" , filename ) ;

% get tick size
switch filename

case { "cc" }
tick = 1 ;

case { "gc" "lb" "pl" "sm" "sp" }
tick = 0.1 ;

case { "ausyen" "bo" "cl" "ct" "dx" "euryen" "gbpyen" "sb" "usdyen" }
tick = 0.01 ;

case { "ng" }
tick = 0.001 ;

case { "auscad" "aususd" "euraus" "eurcad" "eurchf" "eurgbp" "eurusd" "gbpchf" "gbpusd" "ho" "rb" "usdcad" "usdchf" }
tick = 0.0001 ;

case { "c" "o" "s" "es" "nd" "w" }
tick = 0.25 ;

case { "fc" "lc" "lh" "pb" }
tick = 0.025 ;

case { "ed" }
tick = 0.0025 ;

case { "si" }
tick = 0.5 ;

case { "hg" "kc" "oj" "pa" }
tick = 0.05 ;

case { "ty" "us" }
tick = 0.015625 ;

case { "ccmatrix" }
tick = 1 ;

case { "gcmatrix" "lbmatrix" "plmatrix" "smmatrix" "spmatrix" }
tick = 0.1 ;

case { "ausyenmatrix" "bomatrix" "clmatrix" "ctmatrix" "dxmatrix" "euryenmatrix" "gbpyenmatrix" "sbmatrix" "usdyenmatrix" }
tick = 0.01 ;

case { "ngmatrix" }
tick = 0.001 ;

case { "auscadmatrix" "aususdmatrix" "eurausmatrix" "eurcadmatrix" "eurchfmatrix" "eurgbpmatrix" "eurusdmatrix" "gbpchfmatrix" "gbpusdmatrix" "homatrix" "rbmatrix" "usdcadmatrix" "usdchfmatrix" }
tick = 0.0001 ;

case { "cmatrix" "omatrix" "smatrix" "esmatrix" "ndmatrix" "wmatrix" }
tick = 0.25 ;

case { "fcmatrix" "lcmatrix" "lhmatrix" "pbmatrix" }
tick = 0.025 ;

case { "edmatrix" }
tick = 0.0025 ;

case { "simatrix" }
tick = 0.5 ;

case { "hgmatrix" "kcmatrix" "ojmatrix" "pamatrix" }
tick = 0.05 ;

case { "tymatrix" "usmatrix" }
tick = 0.015625 ;

endswitch

%  get data
open = data( : , 4 ) ;
high = data( : , 5 ) ;
low = data( : , 6 ) ;
close = data( : , 7 ) ;

%  clear data

%  create log series for crbm training
vwap = vwap_all_period_detrend( open , high , low , close , tick ) ;
open_from_vwap = log( open ./ vwap ) ; open_from_vwap(1) = 0 ;
open_from_previous_close = log( open ./ shift( close , 1 ) ) ; open_from_previous_close(1) = 0 ;
high_from_vwap = log( high ./ vwap ) ; high_from_vwap(1) = 0 ;
low_from_vwap = log( low ./ vwap ) ; low_from_vwap(1) = 0 ;
close_from_vwap = log( close ./ vwap ) ; close_from_vwap(1) = 0 ;

%  assemble into a features matrix
features = [ open_from_vwap open_from_previous_close high_from_vwap low_from_vwap close_from_vwap ] ;

training_point_index = input( 'The training point index for NN training from figure 1: ' ) ;

%  load the weights etc.
load layer1.mat ;
load layer1_2.mat ;
load layer1_3.mat ;
load layer2.mat ;
load layer2_2.mat ;
load layer2_3.mat ;
load bar_1_mean_std ;
load bar_2_mean_std ;
load bar_3_mean_std ;

%  how-many timesteps do we look back for directed connections - this is what we call the "order" of the model 
n1 = 3 ; % first layer
n2 = 3 ; % second layer

%  Set network properties
numhid1 = 100 ; numhid2 = 100 ; % numepochs = 2000 ;
gsd = 1 ;          % fixed standard deviation for Gaussian units

%  the first bar
init_data = features( training_point_index - 5 : training_point_index , : ) ;

%  which must be normalised using the same data_mean and data_std used to normalise the training data
init_data = ( init_data - repmat( data_mean_1 , size( init_data , 1 ) , 1 ) ) ./ repmat( data_std_1 , size( init_data , 1 ) , 1 ) ; 

%  and we also need the 3 fixed hidden units of the crbm's first hidden layer
[ hidden_1 ] = get_fixed_hidden_layer( init_data , numhid1 , w11 , bj11 , B11 , n1 , n2 , gsd ) ;

%  do Gibbs sampling and generate the visible layer from the above, clamped, init_data and hidden_1 layers
[ visible_data ] = generate_visible_data( init_data , hidden_1 , numhid1 , numhid2 , w11 , w21 , bj11 , bi11 , bj21 , bi21 , A11 , A21 , B11 , B21 , n1 , n2 , gsd ) ;

%  now post process the visible_data - scale back to original using data_mean & data_std from above
visible_data_1 = data_std_1 .* visible_data .+ data_mean_1 ;

%  repeat for 2nd bar
init_data = [ features( training_point_index - 4 : training_point_index , : ) ; visible_data_1 ] ;

%  which must be normalised using the same data_mean and data_std used to normalise the training data
init_data = ( init_data - repmat( data_mean_2 , size( init_data , 1 ) , 1 ) ) ./ repmat( data_std_2 , size( init_data , 1 ) , 1 ) ; 

%  and we also need the 3 fixed hidden units of the crbm's first hidden layer
[ hidden_1 ] = get_fixed_hidden_layer( init_data , numhid1 , w12 , bj12 , B12 , n1 , n2 , gsd ) ;

%  do Gibbs sampling and generate the visible layer from the above, clamped, init_data and hidden_1 layers
[ visible_data ] = generate_visible_data( init_data , hidden_1 , numhid1 , numhid2 , w12 , w22 , bj12 , bi12 , bj22 , bi22 , A12 , A22 , B12 , B22 , n1 , n2 , gsd ) ;

%  now post process the visible_data - scale back to original using data_mean & data_std from above
visible_data_2 = data_std_2 .* visible_data .+ data_mean_2 ;

%  repeat for 3rd bar
init_data = [ features( training_point_index - 3 : training_point_index , : ) ; visible_data_1 ; visible_data_2 ] ;

%  which must be normalised using the same data_mean and data_std used to normalise the training data
init_data = ( init_data - repmat( data_mean_3 , size( init_data , 1 ) , 1 ) ) ./ repmat( data_std_3 , size( init_data , 1 ) , 1 ) ; 

%  and we also need the 3 fixed hidden units of the crbm's first hidden layer
[ hidden_1 ] = get_fixed_hidden_layer( init_data , numhid1 , w13 , bj13 , B13 , n1 , n2 , gsd ) ;

%  do Gibbs sampling and generate the visible layer from the above, clamped, init_data and hidden_1 layers
[ visible_data ] = generate_visible_data( init_data , hidden_1 , numhid1 , numhid2 , w13 , w23 , bj13 , bi13 , bj23 , bi23 , A13 , A23 , B13 , B23 , n1 , n2 , gsd ) ;

%  now post process the visible_data - scale back to original using data_mean & data_std from above
visible_data_3 = data_std_3 .* visible_data .+ data_mean_3 ;

%  create new bars for plotting
new_opens = zeros( 3 , 1 ) ; new_highs = zeros( 3 , 1 ) ; new_lows = zeros( 3 , 1 ) ; new_closes = zeros( 3 , 1 ) ;

visible_data_1 = exp( visible_data_1 ) ; visible_data_2 = exp( visible_data_2 ) ; visible_data_3 = exp( visible_data_3 ) ;

new_opens( 1 , 1 ) = ( visible_data_1( 1 , 1 ) * vwap(training_point_index) + visible_data_1( 1 , 2 ) * close(training_point_index) ) / 2 ;
new_highs( 1 , 1 ) = visible_data_1( 1 , 3 ) * vwap(training_point_index) ;
new_lows( 1 , 1 ) = visible_data_1( 1 , 4 ) * vwap(training_point_index) ;
new_closes( 1 , 1 ) = visible_data_1( 1 , 5 ) * vwap(training_point_index) ;
vwap_1 = ( new_opens(1,1) + new_closes(1,1) + ( new_highs(1,1) + new_lows(1,1) )/2 ) / 3 ;

new_opens( 2 , 1 ) = ( visible_data_2( 1 , 1 ) * vwap_1 + visible_data_2( 1 , 2 ) * new_closes(1,1) ) / 2 ;
new_highs( 2 , 1 ) = visible_data_2( 1 , 3 ) * vwap_1 ;
new_lows( 2 , 1 ) = visible_data_2( 1 , 4 ) * vwap_1 ;
new_closes( 2 , 1 ) = visible_data_2( 1 , 5 ) * vwap_1 ;
vwap_2 = ( new_opens(2,1) + new_closes(2,1) + ( new_highs(2,1) + new_lows(2,1) )/2 ) / 3 ;

new_opens( 3 , 1 ) = ( visible_data_3( 1 , 1 ) * vwap_2 + visible_data_3( 1 , 2 ) * new_closes(2,1) ) / 2 ;
new_highs( 3 , 1 ) = visible_data_3( 1 , 3 ) * vwap_2 ;
new_lows( 3 , 1 ) = visible_data_3( 1 , 4 ) * vwap_2 ;
new_closes( 3 , 1 ) = visible_data_3( 1 , 5 ) * vwap_2 ;

figure(1) ;
plot_opens = open( training_point_index-20:training_point_index+3 , : ) ; plot_highs = high( training_point_index-20:training_point_index+3 , : ) ; plot_lows = low( training_point_index-20:training_point_index+3 , : ) ; plot_closes = close( training_point_index-20:training_point_index+3 , : ) ;
pkg load financial ;
clf() ;
highlow( plot_highs , plot_lows , plot_closes , plot_opens )

figure(2) ;
plot_opens = [ open( training_point_index-20:training_point_index , : ) ; new_opens ] ; plot_highs = [ high( training_point_index-20:training_point_index , : ) ; new_highs ] ; plot_lows = [ low( training_point_index-20:training_point_index , : ) ; new_lows ] ; plot_closes = [ close( training_point_index-20:training_point_index , : ) ; new_closes ] ; ;
clf() ;
highlow( plot_highs , plot_lows , plot_closes , plot_opens )
which produces this as output
and
where the last 3 bars on the right of the second chart are the modelled bars of the 3 rightmost bars in the upper chart. At the moment the modelling isn't very good because I'm using simplistic features - this is just proof of concept coding. Future work to do will include finding much better input features, as well as speeding up the running time of the code by compiling into .oct files, and then optimising parameters using my particle swarm optimisation code.