A Structured Regularization Approach
Report
Zhe Cheng
Instructor: Dr. Ikhlas Abdel-Qader
Catalog
1.Understanding dropout
2.DML using dropout
3.Applying Dropout to Distance Metric
4.Applying Dropout to Training Data
5 .Conclusion 1.Dropout
Dropout prevention overfit , and offers many ways a different neural network effectively about the combination index . The term " pressure " refers to the shedding units (hidden and visible) in a neural network. By reducing a unit , we mean temporarily removed from the network, along with all of its incoming and outgoing connections , shown in Figure 1 , in which the unit is a random selection of decline . In the simplest case , each unit is maintained …show more content…
We will discuss the details in the next part of this range. In particular, we will discuss the two different applications drop out of school, drop out of school to drop out of training lessons metrics and application data that the application, in the following two subsections.
4.Applying Dropout to Distance Metric
In this section, we focus on applying dropout to the learned metric. Let M be metric learned from the precious iteration. To apply the dropout technique, we introduce a Bernoulli random matrix , where each is a Bernoulli random variable with , using the random matrix , we compute the dropped out distance metric denoted by as.
Given by i=j, and we already known that M is a symmetric matrix. With different design of sampling probabilities, we can apply dropout to the learned metric to simulate the effect of L1 regularization. In particular, we introduce a data dependent dropout probability as
Now, instead of perturbing Mt−1, we apply dropout to M’ t , i.e. the matrix after the gradient mapping. It is easy to verify that the expectation of the perturbed matrix Mˆ 0 is given …show more content…
Then, the probability of is
Then, we consider dropout with the probability based on the elements as
6.Applying Dropout to Training Data
In the Guassian noise could perform as the trace norm, the external noise may affect the solution. Therefore, we consider dropout as
where is a binary value random variable and
Note that when we perform dropout to the training data according to this strategy, we actually drop the rows and the corresponding columns in the first component (x t i −x t j )(x t i − x t j ) > of At. Since the expectation of random variables in diagonal is the variance and it is 1 in off diagonal, the expectation of Aˆ