-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
Copy path06-nonlinear-regession.Rmd
5760 lines (4424 loc) · 180 KB
/
06-nonlinear-regession.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Non-Linear Regression {#non-linear-regression}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(kableExtra)
library(knitr)
library(nlstools)
library(investr)
library(MASS)
```
Non-linear regression models differ fundamentally from linear regression models in that the derivatives of the mean function with respect to parameters depend on one or more of the parameters. This dependence adds complexity but also provides greater flexibility to model intricate relationships.
**Linear Regression:**
- **Model Form Example:** A typical linear regression model looks like $y = \beta_0 + \beta_1 x$, where $\beta_0$ and $\beta_1$ are the parameters.
- **Parameter Effect:** The influence of each parameter on $y$ is constant. For example, if $\beta_1$ increases by 1, the change in $y$ is always $x$, regardless of the current value of $\beta_1$.
- **Derivatives:** The partial derivatives of $y$ with respect to each parameter (e.g., $\frac{\partial y}{\partial \beta_1} = x$) do not depend on the parameters $\beta_0$ or $\beta_1$ themselves---they only depend on the data $x$. This makes the mathematics of finding the best-fit line straightforward.
- Straightforward estimation via closed-form solutions like [Ordinary Least Squares].
**Non-linear Regression:**
- **Model Form Example:** Consider $y = \alpha \cdot e^{\beta x}$. Here, $\alpha$ and $\beta$ are parameters, but the relationship is not a straight line.
- **Parameter Effect:** The effect of changing $\alpha$ or $\beta$ on $y$ is not constant. For instance, if you change $\beta$, the impact on $y$ depends on both $x$ and the current value of $\beta$. This makes predictions and adjustments more complex.
- **Derivatives:** Taking the partial derivative with respect to $\beta$ gives $\frac{\partial y}{\partial \beta} = \alpha x e^{\beta x}$. Notice this derivative depends on $\alpha$, $\beta$, and $x$. Unlike linear regression, the sensitivity of $y$ to changes in $\beta$ changes as $\beta$ itself changes.
- Estimation requires iterative algorithms like the [Gauss-Newton Algorithm](#gauss-newton-algorithm-1), as closed-form solutions are not feasible.
| Feature | Linear Regression | Non-Linear Regression |
|--------------------|-----------------------|--------------------------|
| Relationship | Linear in parameters | Non-linear in parameters |
| Interpretability | High | Often challenging |
| Estimation | Closed-form solutions | Iterative algorithms |
| Computational Cost | Low | Higher |
: Summary Table: Linear vs. Non-Linear Regression
**Key Features of Non-linear regression:**
- **Complex Functional Forms**: Non-linear regression allows for relationships that are not straight lines or planes.
- **Interpretability Challenges**: Non-linear models can be difficult to interpret, especially if the functional forms are complex.
- **Practical Use Cases**:
- Growth curves
- High-order polynomials
- Linear approximations (e.g., Taylor expansions)
- Collections of locally linear models or basis functions (e.g., splines)
While these approaches can approximate data, they may suffer from interpretability issues or may not generalize well when data is sparse. Hence, intrinsically non-linear models are often preferred.
------------------------------------------------------------------------
**Intrinsically Non-Linear Models**
The general form of an intrinsically non-linear regression model is:
$$
Y_i = f(\mathbf{x}_i; \mathbf{\theta}) + \epsilon_i
$$
Where:
- $f(\mathbf{x}_i; \mathbf{\theta})$: A **non-linear function** that relates $E(Y_i)$ to the independent variables $\mathbf{x}_i$.
- $\mathbf{x}_i$: A $k \times 1$ vector of independent variables (fixed).
- $\mathbf{\theta}$: A $p \times 1$ vector of parameters.
- $\epsilon_i$: Independent and identically distributed random errors, often assumed to have a mean of 0 and a constant variance $\sigma^2$. In some cases, $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.
Example: Exponential Growth Model
A common non-linear model is the exponential growth function:
$$
y = \theta_1 e^{\theta_2 x} + \epsilon
$$
Where:
- $\theta_1$: Initial value.
- $\theta_2$: Growth rate.
- $x$: Independent variable (e.g., time).
- $\epsilon$: Random error.
## Inference
Since $Y_i = f(\mathbf{x}_i, \theta) + \epsilon_i$, where $\epsilon_i \sim \text{iid}(0, \sigma^2)$, we can estimate parameters ($\hat{\theta}$) by minimizing the sum of squared errors:
$$
\sum_{i=1}^{n} \big(Y_i - f(\mathbf{x}_i, \theta)\big)^2
$$
Let $\hat{\theta}$ be the minimizer, the variance of residuals is estimated as:
$$
s^2 = \hat{\sigma}^2_{\epsilon} = \frac{\sum_{i=1}^{n} \big(Y_i - f(\mathbf{x}_i, \hat{\theta})\big)^2}{n - p}
$$
where $p$ is the number of parameters in $\mathbf{\theta}$, and $n$ is the number of observations.
------------------------------------------------------------------------
**Asymptotic Distribution** of $\hat{\theta}$
Under regularity conditions---most notably that $\epsilon_i \sim N(0, \sigma^2)$ or that $n$ is sufficiently large for a central-limit-type argument---the parameter estimates $\hat{\theta}$ have the following asymptotic normal distribution:
$$
\hat{\theta} \sim AN(\mathbf{\theta}, \sigma^2[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1})
$$
where
- $AN$ stands for "asymptotic normality."
- $\mathbf{F}(\theta)$ is the $n \times p$ Jacobian matrix of partial derivatives of $f(\mathbf{x}_i, \theta)$ with respect to $\mathbf{\theta}$, evaluated at $\hat{\theta}$. Specifically,
$$
\mathbf{F}(\theta) = \begin{pmatrix}
\frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_p} \\
\vdots & \ddots & \vdots \\
\frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_p}
\end{pmatrix}
$$
Asymptotic normality means that as the sample size $n$ becomes large, the sampling distribution of $\hat{\theta}$ approaches a normal distribution, which enables inference on the parameters.
### Linear Functions of the Parameters
A "linear function of the parameters" refers to a quantity that can be written as $\mathbf{a}'\boldsymbol{\theta}$, where $\mathbf{a}$ is some (constant) contrast vector. Common examples include:
- A single parameter $\theta_j$ (using a vector $\mathbf{a}$ with 1 in the $j$-th position and 0 elsewhere).
- Differences, sums, or other contrasts, e.g. $\theta_1 - \theta_2$.
Suppose we are interested in a linear combination of the parameters, such as $\theta_1 - \theta_2$. Define the contrast vector $\mathbf{a}$ as:
$$
\mathbf{a} = (0, 1, -1)'
$$
We then consider inference for $\mathbf{a'\theta}$ ($\mathbf{a}$ can be $p$-dimensional vector). Using rules for the expectation and variance of a linear combination of a random vector $\mathbf{Z}$:
$$
\begin{aligned}
E(\mathbf{a'Z}) &= \mathbf{a'}E(\mathbf{Z}) \\
\text{Var}(\mathbf{a'Z}) &= \mathbf{a'} \text{Var}(\mathbf{Z}) \mathbf{a}
\end{aligned}
$$
We have
$$
\begin{aligned}
E(\mathbf{a'\hat{\theta}}) &= \mathbf{a'}E(\hat{\theta}) \approx \mathbf{a}' \theta \\
\text{Var}(\mathbf{a'} \hat{\theta}) &= \mathbf{a'} \text{Var}(\hat{\theta}) \mathbf{a} \approx \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a}
\end{aligned}
$$
Hence,
$$
\mathbf{a'\hat{\theta}} \sim AN\big(\mathbf{a'\theta}, \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a}\big)
$$
------------------------------------------------------------------------
Confidence Intervals for Linear Contrasts
Since $\mathbf{a'\hat{\theta}}$ is asymptotically independent of $s^2$ (up to order $O1/n$), a two-sided $100(1-\alpha)\%$ confidence interval for $\mathbf{a'\theta}$ is given by:
$$
\mathbf{a'\theta} \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{a'[\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}a}}
$$
where
- $t_{(1-\alpha/2, n-p)}$ is the critical value of the $t$-distribution with $n - p$ degrees of freedom.
- $s = \sqrt{\hat{\sigma^2}_\epsilon}$ is the estimated standard deviation of residuals.
**Special Case**: A Single Parameter $\theta_j$
If we focus on a single parameter $\theta_j$, let $\mathbf{a'} = (0, \dots, 1, \dots, 0)$ (with 1 at the $j$-th position). Then, the confidence interval for $\theta_j$ becomes:
$$
\hat{\theta}_j \pm t_{(1-\alpha/2, n-p)} s \sqrt{\hat{c}^j}
$$
where $\hat{c}^j$ is the $j$-th diagonal element of $[\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}$.
------------------------------------------------------------------------
### Nonlinear Functions of Parameters
In many cases, we are interested in nonlinear functions of $\boldsymbol{\theta}$. Let $h(\boldsymbol{\theta})$ be such a function (e.g., a ratio of parameters, a difference of exponentials, etc.).
When $h(\theta)$ is a nonlinear function of the parameters, we can use a **Taylor series expansion** about $\theta$ to approximate $h(\hat{\theta})$:
$$
h(\hat{\theta}) \approx h(\theta) + \mathbf{h}' [\hat{\theta} - \theta]
$$
where
- $\mathbf{h} = \left( \frac{\partial h}{\partial \theta_1}, \frac{\partial h}{\partial \theta_2}, \dots, \frac{\partial h}{\partial \theta_p} \right)'$ is the gradient vector of partial derivatives.
**Key Approximations:**
1. **Expectation and Variance of** $\hat{\theta}$ (using the asymptotic normality of $\hat{\theta}$: $$
\begin{aligned}
E(\hat{\theta}) &\approx \theta, \\
\text{Var}(\hat{\theta}) &\approx \sigma^2 [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1}.
\end{aligned}
$$
2. **Expectation and Variance of** $h(\hat{\theta})$ (properties of expectation and variance of (approximately) linear transformations): $$
\begin{aligned}
E(h(\hat{\theta})) &\approx h(\theta), \\
\text{Var}(h(\hat{\theta})) &\approx \sigma^2 \mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}.
\end{aligned}$$
Combining these results, we find:
$$
h(\hat{\theta}) \sim AN(h(\theta), \sigma^2 \mathbf{h}' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}),
$$
where $AN$ represents asymptotic normality.
**Confidence Interval for** $h(\theta)$**:**
An approximate $100(1-\alpha)\%$ confidence interval for $h(\theta)$ is:
$$
h(\hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}},
$$
where $\mathbf{h}$ and $\mathbf{F}(\theta)$ are evaluated at $\hat{\theta}$.
------------------------------------------------------------------------
To compute a **prediction interval** for a new observation $Y_0$ at $x = x_0$:
1. **Model Definition**: $$
Y_0 = f(x_0; \theta) + \epsilon_0, \quad \epsilon_0 \sim N(0, \sigma^2),
$$ with the predicted value: $$
\hat{Y}_0 = f(x_0, \hat{\theta}).
$$
2. **Approximation for** $\hat{Y}_0$: As $n \to \infty$, $\hat{\theta} \to \theta$, so we have: $$
f(x_0, \hat{\theta}) \approx f(x_0, \theta) + \mathbf{f}_0(\theta)' [\hat{\theta} - \theta],
$$ where: $$
\mathbf{f}_0(\theta) = \left( \frac{\partial f(x_0, \theta)}{\partial \theta_1}, \dots, \frac{\partial f(x_0, \theta)}{\partial \theta_p} \right)'.
$$
3. **Error Approximation**: $$
\begin{aligned}Y_0 - \hat{Y}_0 &\approx Y_0 - f(x_0,\theta) - f_0(\theta)'[\hat{\theta}-\theta] \\&= \epsilon_0 - f_0(\theta)'[\hat{\theta}-\theta]\end{aligned}
$$
4. **Variance of** $Y_0 - \hat{Y}_0$: $$
\begin{aligned}
\text{Var}(Y_0 - \hat{Y}_0) &\approx \text{Var}(\epsilon_0 - \mathbf{f}_0(\theta)' [\hat{\theta} - \theta]) \\
&= \sigma^2 + \sigma^2 \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta) \\
&= \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big).
\end{aligned}
$$
Hence, the prediction error $Y_0 - \hat{Y}_0$ follows an asymptotic normal distribution:
$$
Y_0 - \hat{Y}_0 \sim AN\big(0, \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big)\big).
$$
A $100(1-\alpha)\%$ prediction interval for $Y_0$ is:
$$
\hat{Y}_0 \pm t_{(1-\alpha/2, n-p)} s \sqrt{1 + \mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}.
$$
where we substitute $\hat{\theta}$ into $\mathbf{f}_0$ and $\mathbf{F}$. Recall $s$ is the estiamte of $\sigma$.
------------------------------------------------------------------------
Sometimes we want a confidence interval for $E(Y_i)$ (i.e., the mean response at $x_0$), rather than a prediction interval for an individual future observation. In that case, the variance term for the random error $\epsilon_0$ is not included. Hence, the formula is the same but without the "+1":
$$
E(Y_0) \approx f(x_0; \theta),
$$
and the confidence interval is:
$$
f(x_0, \hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}.
$$
Summary
- [Linear Functions of the Parameters]: A function $f(x, \theta)$ is linear in $\theta$ if it can be written in the form $$f(x, \theta) = \theta_1 g_1(x) + \theta_2 g_2(x) + \dots + \theta_p g_p(x)$$ where $g_j(x)$ do not depend on $\theta$. In this case, the Jacobian $\mathbf{F}(\theta)$ does not depend on $\theta$ itself (only on $x_i$), and exact formulas often match what is familiar from linear regression.
- [Nonlinear Functions of Parameters]: If $f(x, \theta)$ depends on $\theta$ in a nonlinear way (e.g., $\theta_1 e^{\theta_2 x}, \beta_1/\beta_2$ or more complicated expressions), $\mathbf{F}(\theta)$ depends on $\theta$. Estimation generally requires iterative numerical methods (e.g., Gauss--Newton, Levenberg--Marquardt), and the asymptotic results rely on evaluating partial derivatives at $\hat{\theta}$. Nevertheless, the final inference formulas---confidence intervals, prediction intervals---have a similar form, thanks to the asymptotic normality arguments.
------------------------------------------------------------------------
## Non-linear Least Squares Estimation
The least squares (LS) estimate of $\theta$, denoted as $\hat{\theta}$, minimizes the residual sum of squares:
$$
S(\hat{\theta}) = SSE(\hat{\theta}) = \sum_{i=1}^{n} \{Y_i - f(\mathbf{x}_i; \hat{\theta})\}^2
$$
To solve this, we consider the partial derivatives of $S(\theta)$ with respect to each $\theta_j$ and set them to zero, leading to the normal equations:
$$
\frac{\partial S(\theta)}{\partial \theta_j} = -2 \sum_{i=1}^{n} \{Y_i - f(\mathbf{x}_i; \theta)\} \frac{\partial f(\mathbf{x}_i; \theta)}{\partial \theta_j} = 0
$$
However, these equations are inherently non-linear and, in most cases, cannot be solved analytically. As a result, various estimation techniques are employed to approximate solutions efficiently. These approaches include:
- [Iterative Optimization](#iterative-optimization-nonlinear-regression) -- Methods that refine estimates through successive iterations to minimize error.
- [Derivative-Free] Methods -- Techniques that do not rely on gradient information, useful for complex or non-smooth functions.
- [Stochastic Heuristic](#stochastic-heuristic-nolinear-regression) -- Algorithms that incorporate randomness, such as genetic algorithms or simulated annealing, to explore solution spaces.
- [Linearization](#linearization-nonlinear-regression-optimization)-- Approximating non-linear models with linear ones to enable analytical or numerical solutions.
- [Hybrid](#hybrid-nonlinear-regression-optimization) Approaches -- Combining multiple methods to leverage their respective strengths for improved estimation.
| **Category** | **Method** | **Best For** | **Derivative?** | **Optimization** | **Comp. Cost** |
|:---------------------|:----------|:---------|:---------|:---------|:---------|
| [Iterative Optimization](#iterative-optimization-nonlinear-regression) | [Steepest Descent (Gradient Descent)](#steepest-descent) | Simple problems, slow convergence | Yes | Local | Low |
| **Iterative Optimization** | [Gauss-Newton Algorithm] | Faster than GD, ignores exact second-order info | Yes | Local | Medium |
| **Iterative Optimization** | [Levenberg-Marquardt Algorithm](#levenberg-marquardt) | Balances GN & GD, robust | Yes | Local | Medium |
| **Iterative Optimization** | [Newton-Raphson Method](#newton-raphson) | Quadratic convergence, needs Hessian | Yes | Local | High |
| **Iterative Optimization** | [Quasi-Newton Method] | Approximates Hessian for large problems | Yes | Local | Medium |
| **Iterative Optimization** | [Trust-Region Reflective Algorithm] | Handles constraints, robust | Yes | Local | High |
| [Derivative-Free](Techniques%20that%20do%20not%20rely%20on%20gradient%20information,%20useful%20for%20complex%20or%20non-smooth%20functions.) | [Secant Method](#secant-method) | Approximates derivative from function evaluations | No | Local | Medium |
| **Derivative-Free** | [Nelder-Mead (Simplex)](#nelder-mead) | No derivatives, heuristic | No | Local | Medium |
| **Derivative-Free** | [Powell's Method](#powells-method) | Line search, no explicit gradient | No | Local | Medium |
| **Derivative-Free** | [Grid Search](#grid-search) | Exhaustive search (best in low dims) | No | Global | Very High |
| **Derivative-Free** | [Hooke-Jeeves Pattern Search](#hooke-jeeves) | Pattern-based, black-box optimization | No | Local | Medium |
| **Derivative-Free** | [Bisection Method](#bisection-method) | Simple root/interval-based approach | No | Local | Low |
| [Stochastic Heuristic](#stochastic-heuristic-nolinear-regression) | [Simulated Annealing](#simulated-annealing) | Escapes local minima, non-smooth problems | No | Global | High |
| **Stochastic Heuristic** | [Genetic Algorithm](#genetic-algorithm) | Large search spaces, evolving parameters | No | Global | High |
| **Stochastic Heuristic** | [Particle Swarm Optimization](#particle-swarm-optimization) | Swarm-based, often fast global convergence | No | Global | Medium |
| **Stochastic Heuristic** | [Evolutionary Strategies](#evolutionary-strategies) | High-dimensional, adaptive step-size | No | Global | High |
| **Stochastic Heuristic** | [Differential Evolution Algorithm](#differential-evolution-algorithm) | Robust global optimizer, population-based | No | Global | High |
| **Stochastic Heuristic** | Ant Colony Optimization (ACO) | Discrete / combinatorial problems | No | Global | High |
| [Linearization](#linearization-nonlinear-regression-optimization) | [Taylor Series Approximation](#taylor-series-approximation-nonlinear-optimization) | Local approximation of non-linearity | Yes | Local | Low |
| **Linearization** | [Log-Linearization](#log-linearization-nonlinear-optimization) | Transforms non-linear equations | Yes | Local | Low |
| [Hybrid](#hybrid-nonlinear-regression-optimization) | [Adaptive Levenberg-Marquardt](#adaptive-levenberg-marquardt) | Dynamically adjusts damping in LM | Yes | Local | Medium |
| **Hybrid** | Hybrid Genetic Algorithm & LM (GA-LM) | GA for coarse search, LM for fine-tuning | No | Hybrid | High |
| **Hybrid** | Neural Network-Based NLLS | Deep learning for complex non-linear least squares | No | Hybrid | Very High |
### Iterative Optimization {#iterative-optimization-nonlinear-regression}
#### Gauss-Newton Algorithm
The Gauss-Newton Algorithm is an iterative optimization method used to estimate parameters in nonlinear least squares problems. It refines parameter estimates by approximating the Hessian matrix using first-order derivatives, making it computationally efficient for many practical applications (e.g., regression models in finance and marketing analytics). The objective is to minimize the Sum of Squared Errors (SSE):
$$
SSE(\theta) = \sum_{i=1}^{n} [Y_i - f_i(\theta)]^2,
$$
where $\mathbf{Y} = [Y_1, \dots, Y_n]'$ are the observed responses, and $f_i(\theta)$ are the model-predicted values.
------------------------------------------------------------------------
**Iterative Refinement via Taylor Expansion**
The Gauss-Newton algorithm iteratively refines an initial estimate $\hat{\theta}^{(0)}$ using the **Taylor series expansion** of $f(\mathbf{x}_i; \theta)$ about $\hat{\theta}^{(0)}$. We start with the observation model:
$$
Y_i = f(\mathbf{x}_i; \theta) + \epsilon_i.
$$
By expanding $f(\mathbf{x}_i; \theta)$ around $\hat{\theta}^{(0)}$ and ignoring higher-order terms (assuming the remainder is small), we get:
$$
\begin{aligned}
Y_i &\approx f(\mathbf{x}_i; \hat{\theta}^{(0)})
+ \sum_{j=1}^{p} \frac{\partial f(\mathbf{x}_i; \theta)}{\partial \theta_j} \bigg|_{\theta = \hat{\theta}^{(0)}}
\bigl(\theta_j - \hat{\theta}_j^{(0)}\bigr)
+ \epsilon_i.
\end{aligned}
$$
In **matrix form**, let
$$
\mathbf{Y} =
\begin{bmatrix}
Y_1 \\ \vdots \\ Y_n
\end{bmatrix},
\quad
\mathbf{f}(\hat{\theta}^{(0)}) =
\begin{bmatrix}
f(\mathbf{x}_1, \hat{\theta}^{(0)}) \\ \vdots \\ f(\mathbf{x}_n, \hat{\theta}^{(0)})
\end{bmatrix},
$$
and define the **Jacobian matrix** of partial derivatives
$$
\mathbf{F}(\hat{\theta}^{(0)}) =
\begin{bmatrix}
\frac{\partial f(\mathbf{x}_1, \theta)}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_1, \theta)}{\partial \theta_p} \\
\vdots & \ddots & \vdots \\
\frac{\partial f(\mathbf{x}_n, \theta)}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_n, \theta)}{\partial \theta_p}
\end{bmatrix}_{\theta = \hat{\theta}^{(0)}}.
$$
Then,
$$
\mathbf{Y} \approx \mathbf{f}(\hat{\theta}^{(0)})
+ \mathbf{F}(\hat{\theta}^{(0)})\,(\theta - \hat{\theta}^{(0)}) + \epsilon,
$$
with $\epsilon = [\epsilon_1, \dots, \epsilon_n]'$ assumed i.i.d. with mean $0$ and variance $\sigma^2$.
From this linear approximation,
$$
\mathbf{Y} - \mathbf{f}(\hat{\theta}^{(0)})
\approx \mathbf{F}(\hat{\theta}^{(0)})\,(\theta - \hat{\theta}^{(0)}).
$$
Solving for $\theta - \hat{\theta}^{(0)}$ in the **least squares** sense gives the Gauss increment $\hat{\delta}^{(1)}$, so we can update:
$$
\hat{\theta}^{(1)} = \hat{\theta}^{(0)} + \hat{\delta}^{(1)}.
$$
------------------------------------------------------------------------
**Step-by-Step Procedure**
1. **Initialize**: Start with an initial estimate $\hat{\theta}^{(0)}$ and set $j = 0$.\
2. **Compute Taylor Expansion**: Calculate $\mathbf{f}(\hat{\theta}^{(j)})$ and $\mathbf{F}(\hat{\theta}^{(j)})$.\
3. **Solve for Increment**: Treating $\mathbf{Y} - \mathbf{f}(\hat{\theta}^{(j)}) \approx \mathbf{F}(\hat{\theta}^{(j)})\, (\theta - \hat{\theta}^{(j)})$ as a linear model, use Ordinary Least Squares to compute $\hat{\delta}^{(j+1)}$.\
4. **Update Parameters**: Set $\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} + \hat{\delta}^{(j+1)}$.\
5. **Check for Convergence**: If the convergence criteria are met (see below), stop; otherwise, return to Step 2.\
6. **Estimate Variance**: After convergence, we assume $\epsilon \sim (\mathbf{0}, \sigma^2 \mathbf{I})$. The variance $\sigma^2$ can be estimated by
$$
\hat{\sigma}^2 = \frac{1}{n-p} \bigl(\mathbf{Y} - \mathbf{f}(\mathbf{x}; \hat{\theta})\bigr)' \bigl(\mathbf{Y} - \mathbf{f}(\mathbf{x}; \hat{\theta})\bigr).
$$
------------------------------------------------------------------------
**Convergence Criteria**
Common criteria for deciding when to stop iterating include:
1. **Objective Function Change**: $$
\frac{\bigl|SSE(\hat{\theta}^{(j+1)}) - SSE(\hat{\theta}^{(j)})\bigr|}{SSE(\hat{\theta}^{(j)})} < \gamma_1.
$$
2. **Parameter Change**: $$
\bigl|\hat{\theta}^{(j+1)} - \hat{\theta}^{(j)}\bigr| < \gamma_2.
$$
3. **Residual Projection Criterion**: The residuals satisfy convergence as defined in [@bates1981relative].
------------------------------------------------------------------------
Another way to see the update step is by viewing the **necessary condition for a minimum**: the gradient of $SSE(\theta)$ with respect to $\theta$ should be zero. For
$$
SSE(\theta) = \sum_{i=1}^{n} [Y_i - f_i(\theta)]^2,
$$
the gradient is
$$
\frac{\partial SSE(\theta)}{\partial \theta}
= 2\,\mathbf{F}(\theta)' \bigl[\mathbf{Y} - \mathbf{f}(\theta)\bigr].
$$
Using the Gauss-Newton update rule from iteration $j$ to $j+1$:
$$
\begin{aligned}
\hat{\theta}^{(j+1)}
&= \hat{\theta}^{(j)} + \hat{\delta}^{(j+1)} \\
&= \hat{\theta}^{(j)} + \bigl[\mathbf{F}(\hat{\theta}^{(j)})' \,\mathbf{F}(\hat{\theta}^{(j)})\bigr]^{-1}
\,\mathbf{F}(\hat{\theta}^{(j)})' \bigl[\mathbf{Y} - \mathbf{f}(\hat{\theta}^{(j)})\bigr] \\
&= \hat{\theta}^{(j)}
- \frac{1}{2} \bigl[\mathbf{F}(\hat{\theta}^{(j)})' \,\mathbf{F}(\hat{\theta}^{(j)})\bigr]^{-1}
\, \frac{\partial SSE(\hat{\theta}^{(j)})}{\partial \theta},
\end{aligned}
$$
where:
- $\frac{\partial SSE(\hat{\theta}^{(j)})}{\partial \theta}$ is the **gradient vector**, pointing in the direction of **steepest ascent** of SSE.\
- $\bigl[\mathbf{F}(\hat{\theta}^{(j)})'\mathbf{F}(\hat{\theta}^{(j)})\bigr]^{-1}$ determines the **step size**, controlling how far to move in the direction of improvement.\
- The factor $-\tfrac{1}{2}$ ensures movement in the **direction of steepest descent**, helping to **minimize** the SSE.
The Gauss-Newton method works well when the nonlinear model can be approximated accurately by a **first-order Taylor expansion** near the solution. If the assumption of near-linearity in the residual function $\mathbf{r}(\theta) = \mathbf{Y} - \mathbf{f}(\theta)$ is violated, convergence may be slow or fail altogether. In such cases, more robust methods like [Levenberg-Marquardt Algorithm](#levenberg-marquardt) (which modifies Gauss-Newton with a damping parameter) are often preferred.
------------------------------------------------------------------------
```{r}
# Load necessary libraries
library(minpack.lm) # Provides nonlinear least squares functions
# Define a nonlinear function (exponential model)
nonlinear_model <- function(theta, x) {
# theta is a vector of parameters: theta[1] = A, theta[2] = B
# x is the independent variable
# The model is A * exp(B * x)
theta[1] * exp(theta[2] * x)
}
# Define SSE function for clarity
sse <- function(theta, x, y) {
# SSE = sum of squared errors between actual y and model predictions
sum((y - nonlinear_model(theta, x)) ^ 2)
}
# Generate synthetic data
set.seed(123) # for reproducibility
x <- seq(0, 10, length.out = 100) # 100 points from 0 to 10
true_theta <- c(2, 0.3) # true parameter values
y <- nonlinear_model(true_theta, x) + rnorm(length(x), sd = 0.5)
# Display the first few data points
head(data.frame(x, y))
# Gauss-Newton optimization using nls.lm (Levenberg-Marquardt as extension).
# Initial guess for theta: c(1, 0.1)
fit <- nls.lm(
par = c(1, 0.1),
fn = function(theta)
y - nonlinear_model(theta, x)
)
# Display estimated parameters
cat("Estimated parameters (A, B):\n")
print(fit$par)
```
1. We defined the model "nonlinear_model(theta, x)" which returns A*exp(B*x).
2. We generated synthetic data using the "true_theta" values and added random noise.
3. We used `nls.lm(...)` from the `minpack.lm` package to fit the data:
- `par = c(1, 0.1)` is our initial parameter guess.
- `fn = function(theta) y - nonlinear_model(theta, x)` is the residual function, i.e., observed minus predicted.
4. The `fit$par` provides the estimated parameters after the algorithm converges.
```{r}
# Visualize the data and the fitted model
plot(
x,
y,
main = "Data and Fitted Curve (Gauss-Newton/Levenberg-Marquardt)",
xlab = "x",
ylab = "y",
pch = 19,
cex = 0.5
)
curve(
nonlinear_model(fit$par, x),
from = 0,
to = 10,
add = TRUE,
col = "red",
lwd = 2
)
legend(
"topleft",
legend = c("Data", "Fitted Curve"),
pch = c(19, NA),
lty = c(NA, 1),
col = c("black", "red")
)
```
#### Modified Gauss-Newton Algorithm {#modified-gauss-newton-algorithm}
The Modified Gauss-Newton Algorithm introduces a learning rate $\alpha_j$ to control step size and prevent overshooting the local minimum. The standard [Gauss-Newton Algorithm] assumes that the full step direction $\hat{\delta}^{(j+1)}$ is optimal, but in practice, especially for highly nonlinear problems, it can overstep the minimum or cause numerical instability. The modification introduces a step size reduction, making it more robust.
We redefine the update step as:
$$
\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} + \alpha_j \hat{\delta}^{(j+1)}, \quad 0 < \alpha_j < 1,
$$
where:
- $\alpha_j$ is a **learning rate**, controlling how much of the step $\hat{\delta}^{(j+1)}$ is taken.
- If $\alpha_j = 1$, we recover the standard Gauss-Newton method.
- If $\alpha_j$ is too small, convergence is slow; if too large, the algorithm may diverge.
This learning rate $\alpha_j$ allows for **adaptive step size adjustments**, helping prevent excessive parameter jumps and ensuring that SSE decreases at each iteration.
------------------------------------------------------------------------
A common approach to determine $\alpha_j$ is **step halving**, ensuring that each iteration moves in a direction that reduces SSE. Instead of using a fixed $\alpha_j$, we iteratively reduce the step size until SSE decreases:
$$
\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} + \frac{1}{2^k}\hat{\delta}^{(j+1)},
$$
where:
- $k$ is the smallest non-negative integer such that
$$
SSE(\hat{\theta}^{(j)} + \frac{1}{2^k} \hat{\delta}^{(j+1)}) < SSE(\hat{\theta}^{(j)}).
$$
This means we start with the full step $\hat{\delta}^{(j+1)}$, then try $\hat{\delta}^{(j+1)}/2$, then $\hat{\delta}^{(j+1)}/4$, and so on, until SSE is reduced.
**Algorithm for Step Halving:**
1. Compute the Gauss-Newton step $\hat{\delta}^{(j+1)}$.
2. Set an initial $\alpha_j = 1$.
3. If the updated parameters $\hat{\theta}^{(j)} + \alpha_j \hat{\delta}^{(j+1)}$ increase SSE, divide $\alpha_j$ by 2.
4. Repeat until SSE decreases.
This **ensures monotonic SSE reduction**, preventing divergence due to an overly aggressive step.
------------------------------------------------------------------------
**Generalized Form of the Modified Algorithm**
A more general form of the update rule, incorporating step size control and a matrix $\mathbf{A}_j$, is:
$$
\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} - \alpha_j \mathbf{A}_j \frac{\partial Q(\hat{\theta}^{(j)})}{\partial \theta},
$$
where:
- $\mathbf{A}_j$ is a **positive definite matrix** that preconditions the update direction.
- $\alpha_j$ is the **learning rate**.
- $\frac{\partial Q(\hat{\theta}^{(j)})}{\partial \theta}$ is the **gradient** of the objective function $Q(\theta)$, typically SSE in nonlinear regression.
------------------------------------------------------------------------
**Connection to the Modified Gauss-Newton Algorithm**
The **Modified Gauss-Newton Algorithm** fits this framework:
$$
\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} - \alpha_j [\mathbf{F}(\hat{\theta}^{(j)})' \mathbf{F}(\hat{\theta}^{(j)})]^{-1} \frac{\partial SSE(\hat{\theta}^{(j)})}{\partial \theta}.
$$
Here, we recognize:
- **Objective function**: $Q = SSE$.
- **Preconditioner matrix**: $[\mathbf{F}(\hat{\theta}^{(j)})' \mathbf{F}(\hat{\theta}^{(j)})]^{-1} = \mathbf{A}$.
Thus, the standard Gauss-Newton method can be interpreted as a special case of this broader optimization framework, with a preconditioned gradient descent approach.
------------------------------------------------------------------------
```{r}
# Load required library
library(minpack.lm)
# Define a nonlinear function (exponential model)
nonlinear_model <- function(theta, x) {
theta[1] * exp(theta[2] * x)
}
# Define the Sum of Squared Errors function
sse <- function(theta, x, y) {
sum((y - nonlinear_model(theta, x)) ^ 2)
}
# Generate synthetic data
set.seed(123)
x <- seq(0, 10, length.out = 100)
true_theta <- c(2, 0.3)
y <- nonlinear_model(true_theta, x) + rnorm(length(x), sd = 0.5)
# Gauss-Newton with Step Halving
gauss_newton_modified <-
function(theta_init,
x,
y,
tol = 1e-6,
max_iter = 100) {
theta <- theta_init
for (j in 1:max_iter) {
# Compute Jacobian matrix numerically
epsilon <- 1e-6
F_matrix <-
matrix(0, nrow = length(y), ncol = length(theta))
for (p in 1:length(theta)) {
theta_perturb <- theta
theta_perturb[p] <- theta[p] + epsilon
F_matrix[, p] <-
(nonlinear_model(theta_perturb, x)-nonlinear_model(theta, x))/epsilon
}
# Compute residuals
residuals <- y - nonlinear_model(theta, x)
# Compute Gauss-Newton step
delta <-
solve(t(F_matrix) %*% F_matrix) %*% t(F_matrix) %*% residuals
# Step Halving Implementation
alpha <- 1
k <- 0
while (sse(theta + alpha * delta, x, y) >= sse(theta, x, y) &&
k < 10) {
alpha <- alpha / 2
k <- k + 1
}
# Update theta
theta_new <- theta + alpha * delta
# Check for convergence
if (sum(abs(theta_new - theta)) < tol) {
break
}
theta <- theta_new
}
return(theta)
}
# Run Modified Gauss-Newton Algorithm
theta_init <- c(1, 0.1) # Initial parameter guess
estimated_theta <- gauss_newton_modified(theta_init, x, y)
# Display estimated parameters
cat("Estimated parameters (A, B) with Modified Gauss-Newton:\n")
print(estimated_theta)
# Plot data and fitted curve
plot(
x,
y,
main = "Modified Gauss-Newton: Data & Fitted Curve",
pch = 19,
cex = 0.5,
xlab = "x",
ylab = "y"
)
curve(
nonlinear_model(estimated_theta, x),
from = 0,
to = 10,
add = TRUE,
col = "red",
lwd = 2
)
legend(
"topleft",
legend = c("Data", "Fitted Curve"),
pch = c(19, NA),
lty = c(NA, 1),
col = c("black", "red")
)
```
#### Steepest Descent (Gradient Descent) {#steepest-descent}
The Steepest Descent Method, commonly known as Gradient Descent, is a fundamental iterative optimization technique used for finding parameter estimates that minimize an objective function $\mathbf{Q}(\theta)$. It is a special case of the [Modified Gauss-Newton Algorithm](#modified-gauss-newton-algorithm), where the preconditioning matrix $\mathbf{A}_j$ is replaced by the identity matrix.
The update rule is given by:
$$
\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} - \alpha_j \mathbf{I}_{p \times p}\frac{\partial \mathbf{Q}(\hat{\theta}^{(j)})}{\partial \theta},
$$
where:
- $\alpha_j$ is the **learning rate**, determining the step size.
- $\mathbf{I}_{p \times p}$ is the **identity matrix**, meaning updates occur in the direction of the **negative gradient**.
- $\frac{\partial \mathbf{Q}(\hat{\theta}^{(j)})}{\partial \theta}$ is the **gradient vector**, which provides the direction of **steepest ascent**; its negation ensures movement toward the minimum.
Characteristics of Steepest Descent
- **Slow to converge:** The algorithm moves in the direction of the gradient but does not take into account curvature, which may result in slow convergence, especially in ill-conditioned problems.
- **Moves rapidly initially:** The method can exhibit **fast initial progress**, but as it approaches the minimum, step sizes become small, leading to slow convergence.
- **Useful for initialization:** Due to its simplicity and ease of implementation, it is often used to obtain **starting values** for more advanced methods like **Newton's method** or [Gauss-Newton Algorithm].
------------------------------------------------------------------------
Comparison with Gauss-Newton
| Method | Update Rule | Advantages | Disadvantages |
|---------------|----------------------------|---------------|---------------|
| **Steepest Descent** | $\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} - \alpha_j \mathbf{I} \nabla Q(\theta)$ | Simple, requires only first derivatives | Slow convergence, sensitive to $\alpha_j$ |
| **Gauss-Newton** | $\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} - \mathbf{H}^{-1} \nabla Q(\theta)$ | Uses curvature, faster near solution | Requires Jacobian computation, may diverge |
The **key difference** is that Steepest Descent **only** considers the gradient direction, while Gauss-Newton and Newton's method incorporate curvature information.
------------------------------------------------------------------------
Choosing the Learning Rate $\alpha_j$
A well-chosen **learning rate** is crucial for the success of gradient descent:
- **Too large**: The algorithm may overshoot the minimum and diverge.
- **Too small**: Convergence is **very slow**.
- **Adaptive strategies**:
- **Fixed step size**: $\alpha_j$ is constant.
- **Step size decay**: $\alpha_j$ decreases over iterations (e.g., $\alpha_j = \frac{1}{j}$).
- **Line search**: Choose $\alpha_j$ by minimizing $\mathbf{Q}(\theta^{(j+1)})$ along the gradient direction.
A common approach is **backtracking line search**, where $\alpha_j$ is reduced iteratively until a decrease in $\mathbf{Q}(\theta)$ is observed.
------------------------------------------------------------------------
```{r}
# Load necessary libraries
library(ggplot2)
# Define the nonlinear function (exponential model)
nonlinear_model <- function(theta, x) {
theta[1] * exp(theta[2] * x)
}
# Define the Sum of Squared Errors function
sse <- function(theta, x, y) {
sum((y - nonlinear_model(theta, x)) ^ 2)
}
# Define Gradient of SSE w.r.t theta (computed numerically)
gradient_sse <- function(theta, x, y) {
n <- length(y)
residuals <- y - nonlinear_model(theta, x)
# Partial derivative w.r.t theta_1
grad_1 <- -2 * sum(residuals * exp(theta[2] * x))
# Partial derivative w.r.t theta_2
grad_2 <- -2 * sum(residuals * theta[1] * x * exp(theta[2] * x))
return(c(grad_1, grad_2))
}
# Generate synthetic data
set.seed(123)
x <- seq(0, 10, length.out = 100)
true_theta <- c(2, 0.3)
y <- nonlinear_model(true_theta, x) + rnorm(length(x), sd = 0.5)
# Safe Gradient Descent Implementation
gradient_descent <-
function(theta_init,
x,
y,
alpha = 0.01,
tol = 1e-6,
max_iter = 500) {
theta <- theta_init
sse_values <- numeric(max_iter)
for (j in 1:max_iter) {
grad <- gradient_sse(theta, x, y)
# Check for NaN or Inf values in the gradient (prevents divergence)
if (any(is.na(grad)) || any(is.infinite(grad))) {
cat("Numerical instability detected at iteration",
j,
"\n")
break
}
# Update step
theta_new <- theta - alpha * grad
sse_values[j] <- sse(theta_new, x, y)
# Check for convergence
if (!is.finite(sse_values[j])) {
cat("Divergence detected at iteration", j, "\n")
break
}
if (sum(abs(theta_new - theta)) < tol) {
cat("Converged in", j, "iterations.\n")
return(list(theta = theta_new, sse_values = sse_values[1:j]))
}
theta <- theta_new
}
return(list(theta = theta, sse_values = sse_values))
}
# Run Gradient Descent with a Safe Implementation
theta_init <- c(1, 0.1) # Initial guess
alpha <- 0.001 # Learning rate
result <- gradient_descent(theta_init, x, y, alpha)
# Extract results
estimated_theta <- result$theta
sse_values <- result$sse_values
# Display estimated parameters
cat("Estimated parameters (A, B) using Gradient Descent:\n")
print(estimated_theta)
# Plot convergence of SSE over iterations
# Ensure sse_values has valid data
sse_df <- data.frame(
Iteration = seq_along(sse_values),
SSE = sse_values
)
# Generate improved plot using ggplot()
ggplot(sse_df, aes(x = Iteration, y = SSE)) +
geom_line(color = "blue", linewidth = 1) +
labs(
title = "Gradient Descent Convergence",
x = "Iteration",
y = "SSE"
) +
theme_minimal()
```
1. [Steepest Descent (Gradient Descent)](#steepest-descent) moves in the direction of steepest descent, which can lead to zigzagging behavior.
2. Slow convergence occurs when the curvature of the function varies significantly across dimensions.
3. Learning rate tuning is critical:
- If too large, the algorithm diverges.
- If too small, progress is very slow.
4. Useful for initialization: It is often used to get close to the optimal solution before switching to more advanced methods like [Gauss-Newton Algorithm] or Newton's method.
Several advanced techniques improve the performance of steepest descent:
- **Momentum Gradient Descent**: Adds a momentum term to smooth updates, reducing oscillations.
- **Adaptive Learning Rates**:
- AdaGrad: Adjusts $\alpha_j$ per parameter based on historical gradients.
- RMSprop: Uses a moving average of past gradients to scale updates.
- Adam (Adaptive Moment Estimation): Combines momentum and adaptive learning rates.
In practice, Adam is widely used in machine learning and deep learning, while Newton-based methods (including Gauss-Newton) are preferred for nonlinear regression.
#### Levenberg-Marquardt Algorithm {#levenberg-marquardt}
The Levenberg-Marquardt Algorithm is a widely used optimization method for solving nonlinear least squares problems. It is an adaptive technique that blends the [Gauss-Newton Algorithm] and [Steepest Descent (Gradient Descent)](#steepest-descent), dynamically switching between them based on problem conditions.
The update rule is:
$$
\hat{\theta}^{(j+1)} = \hat{\theta}^{(j)} - \alpha_j [\mathbf{F}(\hat{\theta}^{(j)})' \mathbf{F}(\hat{\theta}^{(j)}) + \tau \mathbf{I}_{p \times p}]\frac{\partial \mathbf{Q}(\hat{\theta}^{(j)})}{\partial \theta}
$$
where:
- $\tau$ is the **damping parameter**, controlling whether the step behaves like [Gauss-Newton Algorithm] or [Steepest Descent (Gradient Descent)](#steepest-descent).
- $\mathbf{I}_{p \times p}$ is the **identity matrix**, ensuring numerical stability.
- $\mathbf{F}(\hat{\theta}^{(j)})$ is the **Jacobian matrix** of partial derivatives.
- $\frac{\partial \mathbf{Q}(\hat{\theta}^{(j)})}{\partial \theta}$ is the **gradient vector**.
- $\alpha_j$ is the **learning rate**, determining step size.
The Levenberg-Marquardt algorithm is particularly useful when the Jacobian matrix $\mathbf{F}(\hat{\theta}^{(j)})$ is nearly singular, meaning that [Gauss-Newton Algorithm] alone may fail.
- When $\tau$ is large, the method behaves like Steepest Descent, ensuring stability.
- When $\tau$ is small, it behaves like Gauss-Newton, accelerating convergence.
- Adaptive control of $\tau$:
- If $SSE(\hat{\theta}^{(j+1)}) < SSE(\hat{\theta}^{(j)})$, reduce $\tau$: $$
\tau \gets \tau / 10
$$
- Otherwise, increase $\tau$ to stabilize: $$
\tau \gets 10\tau
$$
This adjustment ensures that the algorithm moves efficiently while avoiding instability.
```{r}
# Load required libraries
library(minpack.lm)
library(ggplot2)
# Define a nonlinear function (exponential model)
nonlinear_model <- function(theta, x) {
theta[1] * exp(theta[2] * x)
}
# Define SSE function
sse <- function(theta, x, y) {
sum((y - nonlinear_model(theta, x)) ^ 2)
}
# Generate synthetic data
set.seed(123)
x <- seq(0, 10, length.out = 100)
true_theta <- c(2, 0.3)
y <- nonlinear_model(true_theta, x) + rnorm(length(x), sd = 0.5)
# Robust Levenberg-Marquardt Optimization Implementation
levenberg_marquardt <-
function(theta_init,
x,
y,
tol = 1e-6,