try to put some sources of intuition and along with the tf approach.
General intuition:
The regression presented here is a controlled learning problem. In it, as defined by Russel & Norvig Artificial Intelligence , the task is this:
given a set of workouts (X, y) of output pairs m (x1, y1), (x2, y2), ... , (xm, ym) , where each output was generated by an unknown function y = f(x) , find the function h which approximates the true function f
For this, the hypothesis function h combines somehow each x with the parameters to be studied, so that the maximum possible output corresponding to the maximum possible y , and this is for the entire data set. We hope that the resulting function will be close to f .
But how do you know these parameters? to find out , the model must evaluate ., Here comes the function (also called loss, energy, merit ...) for the function: this is a metric function that compares the output of h with the corresponding y , but punishes the big differences .
Now it should be clear what the “learning process" is here: change the parameters to get a low value for the cost function .
Linear Regression:
The example you publish performs a parametric linear regression optimized with gradient descents based on the mean square error as a function of cost. It means:
Parametric : parameter set corrected. They are stored in the same places of memory that carefully study the learning process.
Linear The output of h is just a linear (actually, affine) combination between the input x and your parameters. Therefore, if x and w are real-valued vectors of the same dimension, and b is a real number, then this means that h(x,w, b)= w.transposed()*x+b . Page 107 of the Deep Learning Book provides better insights and insights.
Cost function . Now this is the interesting part. The root mean square error is a convex function. This means that it has one global optimum and, in addition, it can be found directly using a set of normal equations (also explained in DLB). In the case of your example, the stochastic (or / or mini-receiving) gradient descent method is used: this is the preferred method for optimizing non-convex cost functions (which occurs in more advanced models, such as neural networks) or when your data set has a large dimension (also explained in DLB).
Gradient descent : tf deals with this for you, so suffice it to say that GD minimizes the cost function by following its “down” derivative in small steps, while reaching the saddle point. If you absolutely need to know, the exact technique used by TF is called automatic differentiation , a kind of compromise between numerical and symbolic approaches. For convex functions like yours, this item will be global optimal, and (if the learning speed is not too high), it will always converge to it, so it does not matter what values you initialize with your variables using . Random initialization is needed in more complex architectures, such as neural networks. There is some additional code regarding the management of mini-compartments, but I will not go into this because this is not the main focus of your question.
TensorFlow Approach:
The fundamentals of deep learning are currently devoted to many functions by building computational graphs (you can take a look at the presentation on DL infrastructures that I did a few weeks ago). To build and run a TensoFlow graph, a declarative style follows, which means that the graph must first be fully defined and compiled before it is deployed and executed. It is highly recommended that you read this short wiki article if you have not already done so. In this context, the setup is divided into two parts:
First, you define your computational Graph , where you put your data set and parameters in memory placeholders, determine the hypothesis and cost functions based on them, and tell tf which optimization method to use.
Then you run the calculation in Session , and the library can (re) load the data placeholders and perform the optimization.
Code:
Sample code follows this approach:
Define the test data x and stickers y , and prepare for them a placeholder in the Graph (which is provided in the feed_dict part).
Define the 'W' and 'b' placeholders for the parameters. They must be Variables because they will be updated during the session.
Define pred (our hypothesis) and cost , as explained earlier.
From this, the rest of the code should be clearer. As for the optimizer, as I said, tf already knows how to deal with this, but you might want to look at the gradient descent for more details (again, DLB is a pretty good link for this)
Hooray! Andres
CODE EXAMPLES: GRADIENT SPACE VS. NORMAL EQUATIONS
These small fragments generate simple multidimensional data sets and test both approaches. Note that the normal equation approach does not require looping and gives better results. For small dimensions (DIMENSIONS <30k), the preferred approach is probably:
from __future__ import absolute_import, division, print_function import numpy as np import tensorflow as tf GLOBALS