## Restricted Boltzmann MachineNowadays, Restricted Boltzmann Machine is an undirected graphical model that plays a major role in the deep learning framework. Initially, it was introduced by Paul Smolensky in 1986 as a Many hidden layers can be efficiently learned by composing restricted Boltzmann machines using the future activations of one as the training data for the next. These are basically the neural network that belongs to so-called energy-based models. It is an algorithm that is used for dimensionality reduction, classification, regression collaborative filtering, feature learning, and topic modeling. ## Autoencoders vs. Restricted Boltzmann MachineAutoencoders are none other than a neural network that encompasses 3-layers, such that the output layer is connected back to the input layer. It has much less hidden units in comparison to the visible units. It performs the training task in order to minimize reconstruction or error. In simple words, we can say that training helps in discovering an efficient way for the representation of the input data. However, RBM also shares a similar idea, but instead of using deterministic distribution, it uses the stochastic units with a particular distribution. It trains the model to understand the association between the two sets of variables. RBM has two biases, which is one of the most important aspects that distinguish them from other autoencoders. The hidden bias helps the RBM provide the activations on the forward pass, while the visible layer biases help the RBM learns the reconstruction on the backward pass. ## Layers in Restricted Boltzmann MachineThe Restricted Boltzmann Machines are shallow; they basically have two-layer neural nets that constitute the building blocks of deep belief networks. The input layer is the first layer in RBM, which is also known as visible, and then we have the second layer, i.e., the hidden layer. Each node represents a neuron-like unit, which is further interconnected to each other crossways the different layers. But no two nodes of the same layer are linked, affirms that there is no intralayer communication, which is the only restriction in the restricted Boltzmann machine. At each node, the calculation takes place by simply processing the inputs and makes the stochastic decisions about whether it should start transmitting the input or not. ## Working of Restricted Boltzmann MachineA low-level feature is taken by each of the visible node from an item residing in the database so that it can be learned; for example, from a dataset of grayscale images, each visible node would receive one-pixel value for each pixel in one image. Let's follow that single pixel value X through the two-layer net. At the very first node of the hidden layer, After now, we will look at how different inputs get combines at one particular hidden node. Basically, each X gets multiplied by a distinct weight, followed by summing up their products and then add them to the bias. Again, the result is provided to the activation function to produce the output of that node. Each of the input X gets multiplied by an individual weight w at each hidden node. In other words, we can say that a single input would encounter three weights, which will further result in a total of 12 weights, i.e. (4 input nodes x 3 hidden nodes). The weights between the two layers will always form a matrix where the rows are equal to the input nodes, and the columns are equal to the output nodes. Here each of the hidden nodes is going to receive four inputs, which will get multiplied by the separate weights followed by again adding these products to the bias. Then it passes the result through the activation algorithm to produce one output for each hidden node. ## Training of Restricted Boltzmann MachineThe training of a Restricted Boltzmann Machine is completely different from that of the Neural Networks via stochastic gradient descent. Following are the two main training steps: - Gibbs Sampling
Gibbs sampling is the first part of the training. Whenever we are given an input vector This process is repeated numerously (k times), such that after each iteration (k), we obtain another input vector - Contrastive Divergence Step
During the contrastive divergence step, it updates the weight matrix gets. To analyze the activation probabilities for hidden values The update matrix is calculated as a difference between the outer products of the probabilities with input vectors Now with the help of this update weight matrix, we can analyze new weight with the gradient descent that is given by the following equation. ## Training to Prediction
## Building a Restricted Boltzmann MachineWe are going to implement our Restricted Boltzmann Machine with PyTorch, which is a highly advanced Deep Learning and AI platform. We have to make sure that we install PyTorch on our machine, and to do that, follow the below steps. For Windows users: Click on the Windows button in the lower-left corner -> List of programs -> Anaconda -> Anaconda prompt. Inside the Anaconda prompt, run the following command. From the above image, we can see that it asks whether to proceed or not. Confirm it with We can see from the above image that we have successfully installed our library. After this, we will move on to build our two recommended systems, one of which will predict if the user is going to like yes/no a movie, and the other one will predict the rating of a movie by a user. So, the first one will predict the binary outcome, 1 or 0, i.e., yes or no, and the second one predicts the rating from 1 to 5. In this way, we will have the most recommended system that is mostly used in the industry. Nowadays, many companies build some recommended systems and most of the time, these recommended systems either predict if the user or the customer is going to like yes or no the product or some other recommended systems can predict a rating or review of certain products. So, we will create the recommended system that predicts a binary outcome yes or no with our restricted Boltzmann machines. The neural network that we will implement in this topic, and then we will implement the other recommended system that predicts the rating from 1 to 5 in the next topic, which is an Autoencoder. However, for both of these recommended systems, we will use the same dataset, which is actually the real-world dataset that can be found online, i.e., You can download the dataset by clicking on the link; https://grouplens.org/datasets/movielens/, which will direct you to the official website. This dataset was created by the Next, we will import the libraries that we will be using to import our Restricted Boltzmann Machines. Since we will be working with arrays, so we will import After importing all the libraries, classes and functions, we will now import our dataset. The first dataset that we are going to import is all your movies, which are in the file - The first argument is the path that contains the dataset. Here the first element of the path is
**ml-1m**, followed by typing the name of the file, which is**dat**. - The second argument is the separator, and the default separator is the comma that works for the CSV files where the features are separated by commas. Since we already have the titles of the movies and some of them contain a comma in the title, so we cannot use commas because then we could have the same movie in two different columns. Therefore the separator is not a comma but the double colon, i.e., "
**::**". - Then the third argument is the header because actually, the file movies.dat doesn't contain the header, i.e., names of columns. Thus, we need to specify it because the default value of the header is not none because that is the case when there are no column names but infer, so we need to specify that there are no column names, and to do this, we will put
**header = None**. - The next parameter is the engine, which is to make sure that the dataset gets imported correctly, so we will use the
**python**engine to make it efficient. - Lastly, we need to input the last argument, which is the encoding, and we need to input different encoding than usual because some of the movie titles contain special characters that cannot be treated properly with the classic encoding, UTF-8. So, we will input
**latin-1**due to some of the special characters in the movie's title.
After executing the above line, we will get the list of all the movies in the Next, in the same way, we will import the user dataset. So, we will create a new variable
From the above image, we can see that we got all the different information of the users, where the first column is the user ID, the second column is the gender, the third column is the age, the fourth column is some codes that corresponds to the user's job, and lastly the fifth column is the zip code. Now we will import the ratings, which we will do again as we did before just, we will create new variable
After executing the above line of code, we can see that we have successfully imported our ratings variable. Here the first column corresponds to the users, such that all of 1's corresponds to the same user. Then the second column relates to the movies, and the numbers shown in the second column are the movies ID that is contained in the movies DataFrame. Next, the third column corresponds to the ratings, which goes from 1 to 5. And the last column is the timesteps that specify when each user rated the movie. Next, we will prepare the training set and the test set for which we will create a variable As we already saw, the whole original dataset in the ml-100k contains 100,000 ratings, and since each observation corresponds to one rating, we can see from the image given below that after executing the above line of code, we have 80,000 ratings. Therefore, the training set is We can check the
From the above image, we can see that it is exactly the same to that of the ratings dataset that we imported earlier, i.e., the first column corresponds to the users, the second column corresponds to the movies, the third column corresponds to the ratings, and the fourth column corresponds to the timesteps that specifically we really don't need because it won't be relevant to train the model. The training_set is imported as DataFrame, which we have to convert it into an array because later on in this topic, we will be using the PyTorch tensors, and for that, we need an array instead of the DataFrames. After this, we will convert this training set into an array for which we will again take our After executing the above line, we will see that our We can check the
It can be seen from the above image that we got the same values, but this time into an array. Now, in the same way, we will do for the test set, we will prepare the test_set, which will be quite easy this time because we will incorporate the same techniques to import and convert our test_set into an array. We will exactly use the above code. All we got to do is replace the training_set by the After executing the above line, we will get our
Since our dataset is again a DataFrame, so we need to convert it into an array and to do that, we will do it in the same way as we did for the training_set. After running the above line of code, we can see from the image given below that our We can check the
It can be seen from the above image that we got the same values, but this time into an array. In the next step, we are going to get the total number of users and movies because, in the further steps, we will be converting our Therefore, in order to get the total number of users and the total of movies, we will take the maximum of the maximum user ID in the training_set as well as the test_set, so that we can get the total number of users and the total number of movies, which will further help us in making the matrix of users in line and movies in columns. To do this, we will make two new variables, After this, we need to do the same for the By executing the above line, we will get the total number of user IDs is Now, in the same we will do for the movies, we will use the same code but will replace the index of the column users, which is By executing the above line, we get the total number of movie IDs is Therefore, the In the same way, we can check for the After running the above code, we can see from the image given below that the Now we will convert our training_set and test_set into an array with users in lines and movies in columns because we need to make a specific structure of data that will correspond to what the restricted Boltzmann machine expects as inputs. The restricted Boltzmann machines are a type of neural network where you have some input nodes that are the features, and you have some observations going one by one into the networks starting with the input nodes. So, we will create a structure that will contain these observations, which will go into the network, and their different features will go into the input nodes. Basically, we are just making the usual structure of data for neural networks or even for machine learning in general, i.e., with the observation in lines and the features in columns, which is exactly the structure of data expected by the neural network. Thus, we will convert our data into such a structure, and since we are going to do this for both the training_set and the test_set, so we will create a function which we will apply to both of them separately. In order to create a function in python, we will start with Next, we will create a list of lists, which means we will be creating several lists where one list is for each line/user. Since we have So, we will create the list of lists by calling it Since the In order to make for loop, we will introduce a local variable that will loop over all the users of the data, i.e., the training_set or the test_set. So, we will call this variable as Now inside the loop, we will create the first list of this new data list, which is ratings of the first user because here the Then we will take our data, which is assumed to be our training_set because then we will apply to convert to the training_set and then from the training set, we will first take the column that contains all the movie IDs, which is 2 Since we only want the movies IDs of the first user because we are at the beginning of the loop, so we will make some kind of syntax that will tell we want the first column of the data, i.e., the training_set such that the first column equals to one and to do this in python, we will add a new condition that we will add in a new pair of brackets []. Inside this bracket, we will put the condition Now, in the same way, we will get the same for the ratings, i.e., we will get all the ratings of that same first user. Instead of taking id_movies, we will take After this, we will get all the So, we will start with initializing the list of 1682 movies for which we will first call this list as After this, we will replace the zeros by the ratings for the movies that the user rated and in order to do this, we will take the Now, we are left with only one thing to do, i.e., to add the list of ratings here corresponding to one user to the huge list that will contain all the different lists for all the diffe+rent users. So, we will take this whole list, i.e., So, here we are done with our function, now we will apply it to our training_set and test_set. But before moving ahead, we need to add the final line to return what we want and to do that, we will first add The next step is to apply this function to the training_set as well as the test_set, and to do this; we will our In the exact same manner, we will now do for the By running the above section of code, we can see from the below image that the We can also have a look at From the above image, we can see this huge list contains 943 horizontal lists, where each of these Similarly, for the test_set, we have our new version of Again, we can also have a look at From the above image, we can see that we got a list of lists with all the ratings inside, including Next, we will convert our training_set and test_set that are so far a list of lists into some Torch tensors, such that our Thus, we will start by taking our Inside the class, we will take one argument, which has to be the list of lists, i.e., the Similarly, we will do for the After executing the above two lines of code, our training_set and the test_set variable will get disappear, but they are now converted into a Torch tensor, and with this, we are done with the common data pre-processing for a recommended system. In the next step, we will convert the ratings into binary ratings, 0 or 1, because these are going to be the inputs of our restricted Boltzmann machines. So, we will start with the Now we will do for the other ratings as well, i.e., the ratings from 1 to 5. It will be done in the same way as we did above by taking care of ratings that we want to convert into zero, i.e., not liked. As we already discussed, the movies that are not liked by the user are the movies that were given one star or two stars. So, we will do the same for the ratings that were equal to one, simply by replacing 0 by Again, we will do the same for the ratings that were equal to two in the original training_set. We will now replace 1 by After this, we simply need to do for the movies that the users liked. So, the movies that were rated at least three stars were rather liked by the users, which means that the three stars, four stars and five stars will become 1. In order to access the three, four and five stars, we need to replace == by Now we will do the same for the After executing the above section of code, our inputs are ready to go into the RBM so that it can return the ratings of the movies that were not originally rated in the input vector because this is unsupervised deep learning, and that's how it actually works. Now we will create the architecture of the Neural Network, i.e., the Basically, we will make three functions; one to initialize the RBM object that we will create, the second function will be So, we will start with defining the class by naming it as - The first one is the default argument
**self**, which corresponds to the object that will be created afterword. It will help us to define some variables for which we will need to specify these variables are the variables of the object that will be created further and not some global variables. All the variables that are attached to the object will be created by putting a self before the variable. - The second variable is the
**nv**that corresponds to the number of visible nodes. - Lastly, the third argument is the
**nh**, which defines the number of hidden nodes.
Since we want to initialize the weight and bias, so we will go inside the function where we will initialize the parameters of our future objects, the object that we will create from this class. Basically, inside the __init__ function, we will initialize all the parameters that we will optimize during the training of the RBM, i.e., the weights and the bias. And since these parameters are specific to the RBM model, i.e., to our future objects that we are going to create from the RBM class, so we need to specify these variables are the variables of the object. Therefore, to initialize these variables, we need to start with Next, we will initialize the Then we have our third parameter to define, which is still specific to the object that will be created, which is the bias for the visible nodes, so we will name it as Next, we will make the second function that we need for our RBM class, which is all about sampling the hidden nodes according to the probabilities During the training, we will approximate the log-likelihood gradient through Gibbs sampling, and to apply it, we need to compute the probabilities of the hidden nodes given the visible nodes. Then once we have this probability, we can sample the activations of the hidden nodes. So, we will start by calling our Inside the sample_h(), we will pass two arguments; - The first one is the
**self**that corresponds to the**object**because to make sample_h function, we have to use the variables that we defined in it, and to take these variables, we need to take our object, which is identified by itself. So, in order to access these variables, we are taking self here. - Then the second variable is
**x**that corresponds to the**visible neurons**v in the probabilities P(h) given v.
Now, inside the function, we will first compute the probability of h given v, which is the probability that the hidden neurons equal one given the values of the visible neurons, i.e., input vectors of observations with all the ratings. The probability of h given v is nothing but the sigmoid activation function, which is applied to We will start by first computing the product of the weights times the neuron, i.e., As we said earlier that we want to make the product of After this, we will compute what is going to be inside the sigmoid activation function, which is nothing but the wx plus the bias, i.e., the linear function of the neurons where the coefficients are the weights and then we have the bias, a. We will call the wx + a as an As said previously that each input vector will not be treated individually, but inside the batches and even if the batch contains one input vector or one vector of bias, well that input vector still resides in the batch, we will call it as a mini-batch. So, when we add a bias of the hidden nodes, we want to make sure that this bias is applied to each line of the mini-batch, i.e., of each line of the dimension. We will use Next, we will compute the activation function for which we will call After this, in the last step, we will return the probability as well as the sample of h, which is the sample of all the hidden nodes of all the hidden neurons according to the probability p_h_given_v. So, we will first use So, we just implemented the sample_h function to sample the hidden nodes according to the probability p_h_given_v. We will now do the same for In the end, we will output the predicted ratings, 0 or 1 of the movies that were not originally rated by the user, and these new ratings that we get in the end will be taken from what we obtained in the hidden node, i.e., from the sample of the hidden node. Thus, we will make the function And in order to make this function, it is exactly the same as that of the above function; we will only need to replace few things. First, we will call the function Here we will return the p_v_given_h and some samples of the visible node still based on the Bernoulli sampling, i.e., we have our vector of probabilities of the visible nodes, and from this vector, we will return some sampling of the visible node. Next, we will change what's inside the activation function, and to that, we will first replace variable x by Similarly, we will replace wx by Now we will make our last function, which is about the contrastive divergence that we will use to approximate the log-likelihood gradient because the RBM is an energy-based model, i.e., we have some energy function which we are trying to minimize and since this energy function depends on the weights of the model, all the weights in the tensor of weights that we defined in the beginning, so we need to optimize these weights to minimize the energy. Not that it can be seen as an energy-based model, but it can also be seen as a probabilistic graphical model where the goal is to maximize the log-likelihood of the training set. In order to minimize the energy or to maximize the log-likelihood for any deep learning model or a machine learning model, we need to compute the gradient. However, the direct computations of the gradient are too heavy, so instead of directly computing it, we will rather try to approximate the gradient with the help of Contrastive Divergence. So, we will again start with defining our new function called a - The first argument is the
**self**because we will update the tensor of weights and the bias a and b that are variables specifically attached to the object. - The second argument is the input vector, which we will call as
**v0**that contains the ratings of all the movies by one user. - The third argument is
**vk**that corresponds to the visible nodes obtained after k sampling, i.e., after k round trips from the visible nodes to the hidden nodes first and then way back from the hidden nodes to the visible nodes. So, the visible nodes are obtained after k iterations and k contrastive divergence. - Then our fourth argument is
**ph0**, which is the vector of probabilities that at the first iteration, the hidden nodes equal one given the values of v0, i.e., our input vector of observation. - Lastly, we will take our fifth argument, which is
**phk**, that will correspond to the probabilities of hidden nodes after k sampling given the values of the visible nodes, vk.
After this, we will take our tensor or weights Thus, in order to do that, we will first take our Then we will need to subtract again Next, we will update the weight b, which is the bias of the probabilities p(v) given h and in order to do that, we will start by taking After this, we will do our last update, i.e., bias a that contains the probabilities of P(h) given v. So, we will start with Now we have our class, and we can use it to create several objects. So, we can create several RBM models. We can test many of them with different configurations, i.e., with several number of hidden nodes because that is our main parameter. But then we can also add some more parameters to the class like a learning rate in order to improve and tune the model. After executing the above sections of code, we are now ready to create our RBM object for which we will need two parameters, Therefore, we will start by defining Next, we will do for Then we have another variable, So, this additional parameter that we can tune as well to try to improve the model, in the end, is the batch_size itself. In order to get fast training, we will create a new variable Now we will create our RBM object because we have our two required parameters of the __init__ method, i.e., nv and nh. In order to create our object, we will start by calling our object as Next, we will move on to training our Restricted Boltzmann Machines for which we have to include inside of a for loop, the different functions that we made in the RBM class. We will start by choosing a number of epochs for which we will call the variable After this, we will make a for loop that will go through the 10 epochs. In each epoch, all our observations will go back into the network, followed by updating the weights after the observations of each batch passed through the network, and then, in the end, we will get the final visible node with the new ratings for the movies that were not originally rated. In order to make the for loop, we will start with Then we will go inside the loop and make the loss function to measure the error between the predictions and the real ratings. In this training, we will compare the predictions to the ratings we already have, i.e., the ratings of the training_set. So, basically, we will measure the difference between the predicted ratings, i.e., either 0 or 1 and the real ratings 0 or 1. For this RBM model, we will go with the simple difference in the absolute value method to measure the loss. Thus, we will introduce a loss variable, calling it as After this, we will need a counter because we are going to normalize the train_loss and to normalize the train_loss, we will simply divide the train_loss by the counter, Next, we will do the real training that happens with the three functions that we created so far in the above steps, i.e., sample _h, sample_v and train when we made these functions was regarding one user, and of course, the samplings, as well as the contrastive divergence algorithm, have to be done overall users in the batch. Therefore, we will first get the batches of users, and in order to do that, we will need another for loop. Here we are going to make a for loop inside the first for loop, so will start with Now before we move ahead, one important point is to be noted that we want to take some batches of users. We don't want to take each user one by one and then update the weights, but we want to update the weight after each batch of users going through the network. Therefore, we will not take each user one by one, but we will take the batches of the users. Since the batch_size equals 100, well, the first batch will contain all the users from index 0 to 99, then the second batch_size will contain the users from index 100 to index 199, and the third batch_size will be from 200 to 299, etc. until the end. So, the last batch that will go into the network will be the batch_size of the users from index 943 - 100 = 84, which means that the last batch will contain the users from 843 to 943. Hence the stop of the range for the user is not nb_users but Now we will get inside the loop, and our first step will be separating out the input and the target, where the input is the ratings of all the movies by the specific user we are dealing in the loop and the target is going to be at the beginning the same as the input. Since the input is going to be inside the Gibbs chain and will be updated to get the new ratings in each visible node, so the input will get change, but the target will remain the same. Therefore, we will call the input as Similarly, we will do for the target, which is the batch of the original ratings that we don't want to touch, but we want to compare it in the end to our predicted ratings. We need it because we want to measure the error between the predicted ratings and the real ratings to get the loss, the train_loss. So, we will call the target as Then we will take In the next step, we will add another for loop for the k steps of contrastive divergence. So, we will start with Next, we will take Now we will update the In the next step, we will update the weights and the bias with the help of vk. But before moving ahead, we need to do one important thing. i.e., we will skip the sales that have -1 ratings in the training process by freezing the visible nodes that contain -1 ratings because it would not be possible to update them during the Gibbs sampling. In order to freeze the visible nodes containing the -1 ratings, we will take Next, we will compute the phk before applying the train function, and to do this, we will start by taking the After now, we will apply the train function, and since it doesn't return anything, so we will not create any new variable, instead, we will our Now the training will happen easily as well as the weights, and the bias will be updated towards the direction of the maximum likelihood, and therefore, all our probabilities P(v) given the states of the hidden nodes will be more relevant. We will get the largest weights for the probabilities that are the most significant, and will eventually lead us to some predicted ratings, which will be close to the real ratings. Next, we will update the Here we will measure the errors with the help of simple distance in absolute values between the predictions and the real ratings, and to do so, we will use Now we will update the counter for normalizing the train_loss. So, here we will increment it by 1 in the float. Lastly, we will print all that is going to happen in training, i.e., the number of epochs to see in which epoch we are during the training, and then for these epochs, we want to see the loss, how it is decreasing. So, we will use the Inside the print function, we will start with a string, which is going to be the epoch, i.e. So, after executing the above section of code, we can see from the image given below that we ended with a Next, we will get the final results on the new observations with the In order to get the test_set results, we will replace the training_set with the Then we have the for loop over all the users of the test_set, so we will not include the batch_size because it is just a technique to specific to the training. It is a parameter that you can tune to get more or less performance results on the training_set and, therefore, on the test_set. But gathering the observations in the batch_size is only for the training phase. Thus, we will remove everything that is related to the batch_size, and we will take the users up to the last user because, basically, we will make some predictions for each user one by one. Also, we will remove 0 because that's the default start. So, we will do Now for each user, we will go into the loop, and we will again remove the batch_size because we don't really need them. Since we are going to make the predictions for each user one by one, so we will simply replace the batch_size by Now, we will move on to the next step in which we will make one step so that our prediction will be directly the result of one round trip of Gibbs sampling, or we can say one step, one iteration of the bind walk. Here we will simply remove the for a loop because we don't have to make 10 steps, we only have to make one single step. So, to make this one step, we will start with the Since we only have to make one step of the blind walk, i.e., the Gibbs sampling, because we don't have a loop over 10 steps, so we will remove all the k's. Next, we will replace the train_loss by the After now, we will update the Lastly, we will print the final test_loss for which we will get rid of all the epochs from the code. Then from the first string, we will replace the loss by the
Thus, after executing the above line of code, we can see from the above image that we get a Next Topic# |