```
using Plots
using GLM
```

# Linear Regressions with Julia

The first time I heard about Julia was a few months ago, in an introductory class to programming. Since then, I have come to know a bit more about the reputation of the language, supposely very efficient and especially suited for scientific computations.

In this article, I will try to perform some basic linear regressions with Julia and to plot them.

## Simple linear regression

Here, let us try to explain one continuous variable by another continuous variable.

First, those are the packages we are going to use :

The Plots package seems to be the most general package to plot data with Julia. It is a simple interface to several underlying plotting packages, like

**gr**or**plotly**. In this sense, it allows to easily switch between different frameworks that have different features, while keeping the same syntax.The GLM package seems to be the main package for Generalized Linear Models (GLM) with Julia. It will be useful in the computation of our simple and multinomial linear models.

Now, let us generate some data :

```
= range(0, 1, length=100)
x # Generate data :
= rand(100) .* x
y
# Check the data :
size(y)
```

`(100,)`

- The function
`range()`

in Julia allows to create an array with a starting and an ending value. Although having similar properties than vectors, arrays created with the`range()`

function have the “Range” type (or`StepRangeLen`

to be exact). Here, I just initialise the`x`

array to compute an easy to use`y`

vector. - The function
`rand()`

in Julia allows to randomly generate numbers. The default settings of the function make it so that the drawn data is of type`Float64`

between \(0\) and \(1\). Its argument, set to the value of \(100\), indicates the number of random values that are to be drawn. Therefore,`rand(100)`

returns a vector of 100`Float64`

numbers. - The
`.*`

function is a vectorized multiplication. In Julia, adding a dot`.`

before a mathematical operator allows to apply this operation to all the elements of a vector. If we wanted to add \(2\) to all the elements of the`y`

vector, we should have written`y .+= 2`

. - The last function
`size()`

is useful to get more dimension of the array we consider. In our case, it yields`(100,)`

because it is a \(100\) rows vector. We would get the same if we did`size(y)`

.

Now, let us plot the data we just created :

```
plot(
scatter(x,y, label = "data"),
= "Generating Random Data",
title = "Random variable",
ylabel = "x",
xlabel )
```

The function `plot()`

in Julia comes from the Plots package. It is the default plotting function, with the **gr** backend by default (we will get to that later on). It contains several elements :

- The function
`scatter()`

allows to plot a scatterplot, i.e. independent points that are not linked by a line. If we had not indicated the`scatter()`

function, it would have plotted a line between all the points. - The
`label`

value is the one that gets displayed within the box. Specifying “data” allows to avoid the “y1” default value. `title`

,`ylabel`

and`xlabel`

are pretty straightforward to understand.

It is now time to run a simple linear regression. Two simple ways to run such a model are :

- Using the GLM package,
- Using the built-in matrix computation function of Julia.

### GLM package

First, with the GLM package :

`= lm(@formula(y ~ x), (;y, x)) model `

```
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int32}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.00995436 0.03272 0.30 0.7616 -0.0549775 0.0748862
x 0.468727 0.0565302 8.29 <1e-12 0.356545 0.58091
──────────────────────────────────────────────────────────────────────────
```

Now, if we want to plot our model with the data, we can run :

```
= GLM.coef(model)
coefs plot!((x) -> coefs[1] + coefs[2] * x, label="model",
= "Simple linear regression") title
```

If we want to include the equation of our linear model in the legend of the plot, we can use the LaTeXStrings.jl package :

```
using LaTeXStrings
plot(title = "Simple linear regression",
scatter(x,y, label = false),
-> coefs[1] + coefs[2] * x, label=latexstring("y=$(coefs[1]) + $(coefs[2]) * x")) (x)
```

And if we want it to be outside the plot :

`plot!(legend=:outerbottom, legendcolumns=1)`

### Matrix computation

Second, with the built-in matrix computation function :

```
= [ones(length(x)) x]
X \y) (X
```

```
2-element Vector{Float64}:
0.00995436086366171
0.4687272109439466
```

The first line creates a matrix \(X\) such that \(X\in\mathbb{R}^{100\times 100}\), with its first columns being only ones, and its second column being the vector of \(x\).

Without entering into the details, it does exactly what the OLS method does, but in a more rudimentary way, using the normal equation to find the OLS estimates.

### Performance comparison

We see that both approaches yield the same result, but what are the differences between them ? Apart from the syntax, we could check the performance of each of them. In order for us to do that, we are going to use the `@time`

function, which returns the time and space used by a chosen function.

```
@time begin
= ones(length(x))
vec_1 = transpose([transpose(vec_1);transpose(x)])
X \y)
(Xend
@time begin
= lm(@formula(y ~ x), (;x, y))
model coef(model)[1:2]
GLM.end
```

Built-in matrix approach :

` 0.000070 seconds (30 allocations: 40.266 KiB)`

```
2-element Vector{Float64}:
0.00995436086366171
0.4687272109439466
```

GLM package approach :

` 0.000120 seconds (146 allocations: 20.438 KiB)`

```
2-element Vector{Float64}:
0.009954360863661988
0.46872721094394604
```

If we compare the information returned by the `@time`

function, we see that the matrix computation approach seems to be slightly faster. However, we have to take into account that the `lm()`

function does not only compute the coefficients, but alswo other informations, that are here not displayed, such as the confidence interval, and the probability associated to the t-statistic. In this sense, this comparison does not make justice to the GLM package. We can however remember that if we only wanted the coefficients of such a model, we could prefer the matrix multiplication approach if we need time.

## Multiple linear regression

Here, let us try to explain one continuous variable by multiple continuous variables.

Again, we first have to generate and plot the data :

```
# Generating random data :
= (rand(100) for _ in 1:4)
x,y,w,z
# Plotting the data :
plot(scatter(x, y, z,
= "Static 3D plot",
title = w, label="data")) marker_z
```

### Rotatable plot

Now, we could continue with the plot function of the default **gr** backend package of `Plots`

, but to get rotable 3D plots, we can use the PlotlyJS package. Since Plot.jl integrates plotly, we could try just adding `plotly()`

:

```
# We switch from the gr() to the plotlyjs() backend within Plots :
plotlyjs()
# We plot :
plot(scatter(x, y, z;
= w,
marker_z = 2,
markersize = "Rotatable 3D plot",
title = "data",
label = "x",
xlabel = "y",
ylabel = "z")) zlabel
```

It is now time for some model. Let us consider that the variable `z`

can be explained linearly by the variables `x`

and `y`

, and that `w`

does not provide supplementary information.

We are going to use the two methods we saw previously :

```
# Using matrix computations :
@time begin
= [ones(length(x)) x y]
X \z
Xend
# Using GLM :
@time begin
= lm(@formula(z ~ x + y), (;x, y, z))
model = GLM.coef(model)
coefs
coefsend
```

Built-in matrix approach :

` 0.000045 seconds (33 allocations: 42.531 KiB)`

```
3-element Vector{Float64}:
0.36233333368806564
-0.0017472521580554313
0.21145297793195228
```

GLM package approach :

` 0.000265 seconds (205 allocations: 29.141 KiB)`

```
3-element Vector{Float64}:
0.362333333688066
-0.0017472521580556221
0.2114529779319519
```

As before, both results are approximatively the same, with a small advantage for the matrix computation approach in performance terms.

### Plotting a multinomial regression

Now let’s try to plot the plane of this regression :

```
linear_model(x,y) = coefs[1] .+ coefs[2] * x .+ coefs[3] * y
plot(x,y,linear_model,
=:surface,
st= "Plane of a two dimensional linear regression",
title ="model",
label="x",
xlabel = "y",
ylabel = "z") zlabel
```

While I did not manage to plot simulatneously the points and the surface on the plot, it seemed a satisfactory result for the moment.