Why are so many problems linear and how would one solve nonlinear problems?
I am taking a deep learning in Python class this semester and we are doing linear algebra.
Last lecture we "invented" linear regression with gradient descent (did least squares the lecture before) from scratch where we talked about defining hypotheses, loss function, cost function, etc.
Why can many problems be looked at as linear and "just about trying to find a solution for the Equation Ax = b"?
Doing that can be done by stuff like least squares or training a neutral network.
In the real world, most problems are not linear.
How would one tackle those problems, as linear algebra only applies to linear functions?
@EdM is, of course, correct. And those transformations are very, very flexible.
But let's take a really simple case. One independent variable, one dependent one. And a straight line for the fit (no transformations).
First, it's not a dichotomy between cases where this fits and where it doesn't. Sometimes, this simple straight line is a very good fit to the data; a lot of physics problems are like this. Sometimes this straight line is a terrible fit: Take anything which is sinusoidal, just as one case. If $y = \sin x$ then a straight line will not work at all.
More often, though, it's sort of an OK fit. Remember, as George Box said "all models are wrong, but some are useful." Even in those physics problems the straight line will ignore some issues (e.g. friction, air resistance, whatever). In other cases, there will be a lot of error in the model, and a better fit would be obtained with a more complex model.
A lot of the art and science of data analysis is figuring out how much complexity is "worth". Should we model a transformation? If so, just a quadratic? Or a spline? Perhaps a fractional polynomial. Maybe we need control variables. Moderators. Mediators. Etc.
Or maybe the straight line is enough.
In my view, this isn't a purely statistical question. We have to consider the context. Again, for me, this is what made being a statistical consultant fun.
As for how one tackles such problems, well, what I do is try to figure out what makes sense. Computers make this playing easy. But I try to be careful to not torture the data too much -- and there are ways to avoid that, too.
First, linear regression is only linear in its coefficients, not in the variables, so you absolutely can fit a quadratic like so
$$y_i = \beta_0+\beta_1x_{i}+\beta_2x^2_i + e_i$$
That we can approximate anything to a polynomial is a consequence of Taylor's theorem, really, and most often linear or quadratic is good enough for our purposes, they are sufficiently good approximations.
Take the case of economics where linear regression sees ample usage. Here, most often, the question is one of distinguishing correlation from causation (rather than predicting with precision). In this context all we need to know if a change in X entails some significant change in y over some short enough interval that we can easily approximate the (ostensibly) continuous function with a straight line.
Youâre right: itâs quite an assumption that the world is so simple that it can be modeled with lines, planes, and hyperplanes. But, theâ¦
STONE-WEIERSTRASS THEOREM
â¦says that, technicalities aside, âdecentâ functions can be approximated arbitrarily well by polynomials. If youâve gone far enough in linear algebra, you know that complicated polynomials like $wxz-x^7y^9-wz^2+9w^5x^3yz^8$ can be viewed as linear combinations of basis elements of a vector space. This gives a way to express that polynomial as a dot product of a vector of basis elements and a vector of weights. Across multiple data points, that becomes the familiar $X\beta$ from linear regression.
This is not limited to polynomials. Any linear combination (weighted sum/difference) of functions of the original data can be represented as a dot product. Fourier series can be represented this way to obtain periodicity in the regression fit. Splines can model curvature and can have advantages over polynomials in doing so. You can interact functions of single variables with something like $\sin(x_1)\cos(x_2)$.
Overall, that seemingly simple formulation of linear regression as $X\beta$ can model an enormous amount of complicated behavior.
Despite what they might tell you, it's not that "so many problems are linear", it's that we settle for linear approximations when the nonlinear versions are too difficult. They don't necessarily answer the exact question we want to ask, but we settle for them because the exact questions we want to ask are too difficult to answer.
For example, with your own example of linear regression, often the quantity you're actually interested in minimizing is absolute deviation (L1 norm) rather than squared deviations (L2 norm). But whereas squared deviations are trivial to minimize (just set the derivative equal to zero and you get a linear equation with a unique solution), absolute deviations don't give you an easy-to-solve equation when you differentiate them. So, traditionally, people have given up and settled for squared deviations.
Of course, nowadays, with computers and better algorithms, minimizing absolute deviations is easier than it used to be, but it's still hard to do, and depending on your particular problem you are still likely to settle for squared deviations.
TLDR; There are several reasons that make ordinary least squares (OLS) popular, and that's why many problems are approached with that technique (it doesn't necessarily make the problems inheritly linear).
Why can many problems be looked at as linear and "just about trying to find a solution for the Equation Ax = b"?
A lot of these problems are not 'linear' (depending on how you consider linear).
OLS regression, the equation $y = X \beta$, is only linear in the parameters $\beta$.
It can capture non-linear relationships between $y$ and some variable $x$ by having non-linear functions of $x$ in the columns of the matrix $X$.
See also:
Non-linear relationships can appear approximately as linear or polynomial functions. A polynomial OLS regression might be applied to it, but that may not be the true underlying relationship. See Interpreting logistic modelling and linear modelling results for the same formula
Relationships can be linearized by transforming the variables such that the transformed problem is linear in the parameters.
This linearisation is often performed to make computations easier. For example:
With more computation power and a lot of data linear methods become less necessary (that's what this question ponders about: Why do we use Linear Models when tree based models often work better than linear models?). But simple linear or quadratic relationships still provide easy interpretable relationships.
In the real world, most problems are not linear.
I'm analytical chemist. At least in my field, linear problems are not rare at all. (This answer thus may be seen as illustration to @PeterFlom's and @SextusEmpiricus' answers.)
In the past 100 or 150 years, massive amounts of work went into finding ways of getting measurements/signals that allow linear predictions of the characteristics we actually want to predict (often concentrations or physical values such as viscosity, density, etc.). (With an uncertainty that is sensible for the application at hand.)
(Often even in both meanings of the word linear: linear in the parameters ("statistician's linearity") but in addition also linear in the independent variable ("chemist's linearity") - we call these models bilinear in chemometrics.)
The linearity in the independent is often obtained by transformation - e.g. we fit pH = $-\log_{10}$ concentration vs. voltage instead of proton concentration, because there linearity is a sensible approximation for most applications. And if that approximation is not sufficiently good, we move on and use activity instead of concentration (all this is undergrad physical chemistry). Suitable transformations are usually derived by physics and chemistry, but if need be, one can fall back to using generic Taylor expansion as pointed out in some of the other answers.
In other words, in analytical chemistry (as well as other similarly "measuring" fields) we actually use hardware and/or software approaches that yield data for the statistical modelling/machine learning that are already cleaned of a lot of the known artefacts that cause deviation from linear behaviour, and that are transformed by external knowledge (physics, chemistry) in ways to yield linear behaviour (again, typically in both meanings - but at least in the stats meaning).
And there are often good reasons to stick to this effort even though it is nowadays computationally feasible to use non-linear models.
E.g., in Raman spectroscopy, so-called shot noise following a Poisson distribution means that removing background light optically leads to better sigal to noise ratio compared to measuring Raman signal on top of background light and then subtracting the background light signal later on.
In addition, wet-lab science often has small sample sizes, restricting the model complexity we can afford. We just don't have sufficient real data available or they are very expensive to obtain. This is where transformations according to physics/chemistry that lead to data which is known to work with low-complexity linear models really helps a lot.
(Up to now, pre-trained deep models for "my" types of data are not available)