## Our objective

If you studied math at school until at least 16 years old, chances are you studied the fundamentals of machine learning, what we commonly describe as artificial intelligence (AI) today. We will teach you this via linear regression.

But first a question:

**Do you remember calculating line of best fit during algebra class?**

If you answered “yes” (even if you don’t remember *how* you did it) then this tutorial is for you. If you answered “no”, check out this higher-level explanation of machine learning.

Assuming you answered “yes”, great news: you can learn the fundamentals to most machine learning applications. These are broadly the same for most machine learning and deep learning systems.

Of course, there’s a lot more to machine learning than what we can cover in a single post, and such systems are a lot more complex in real life, but an understanding of the basics will go a long way, and provide confidence to explore these topics and more complex variations in more detail.

We hope this provides greater intuition regarding what is actually happening under the hood, without needing to know how to build your own engine. In particular, after reading this article you will understand:

*What*a classical machine learning system is designed to do and*how*it does so*How*inputs are analysed by a machine learning system in order to generate an output*How*a machine learning system scores its abilities in order to improve itself*Which parts*of the machine learning system are fixed by the human designers, and those generated by the machine

We will do this via the example of linear regression, a commonly used machine learning technique using minimal maths. In some places we may need to simplify things in order to expedite intuition.

**Disclaimer:** if you’re a developer or machine learning expert, this post probably isn’t for you as we necessarily keep things as simple as possible and in a few cases omit some details to keep things accessible. In a number of places, we’ve abstracted some of the process and math, in particular omitting batching of different steps and summations. If this upsets you, then it’s not the article you want!

## How does machine learning fit into AI?

Machine learning is the study of computer algorithms that improve automatically through experience. It is a subset of AI.

## What is Machine Learning?

The classical definition is Tom M. Mitchell‘s. Mitchell – a computer scientist – provided a widely quoted, definition of the algorithms studied for machine learning:

“A computer program is said to learn from experience

Ewith respect to some class of tasksTand performance measurePif its performance at tasks inT, as measured byP, improves with experienceE.

Past experience, E, is data. Machine learning is **garbage in, garbage out** (or **GIGO**). If the data is wrong, incomplete, inconsistent, unrepresentative, or two few, the system will perform poorly at its task, struggling to learn the correct behaviour(s) that improve its performance at that task over time (and vice versa).

## How is machine learning different to previous forms of AI?

Recent advances in AI have been achieved by applying machine learning to very large data sets. These algorithms detect patterns, learning how to make predictions and recommendations by processing data and experiences, rather than by receiving explicit programming instruction (by humans). The algorithms also adapt in response to new data and experiences to improve efficacy over time.

Previous attempts at AI relied on hand-coded logic, e.g. IF this THEN that. They required human adaptation of the algorithms via addition, deletion or amendment of that logic in response to new circumstances or understandings about a particular AI problem.

## What are major types of machine learning?

There are **three** major types:

#### Supervised Learning

The computer is presented with example inputs (x) labelled with their desired outputs (y), given by a “teacher” (an engineer or subject matter expert). The goal is to learn a general rule mapping inputs (x) to outputs (y). This is the subject matter of the later worked example we will use to teach you machine learning, specifically the technique linear regression.

#### Unsupervised Learning

Unlike supervised learning, no labels are given to the algorithm. Instead, it must discern its own way in which to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

#### Reinforcement Learning

A computer program interacts with a dynamic environment in which it must perform a certain goal (such as playing a game against an opponent) using a set of policies. As it navigates its problem space, the program is provided feedback that’s analogous to rewards, which it tries to maximize by adjusting the policies in order to perform better at each subsequent attempt. It is a bit like teaching a child (or pet) good behaviours from bad, giving them a set of rules and seeing how well they understand and adapt them to novel situations, rewarding them with treats if they do well, or punishing them if they do poorly!

#### Supervised vs Unsupervised Learning – which is better?

For further discussion of supervised vs. unsupervised learning, and which is better, see here.

#### What can you do with these techniques?

Machine learning systems use data to learn the mathematical relationship that produces an output y given an input x. Where y can be:

- a pattern, e.g. explores customer demographic data to identify patterns (which customers group together, and therefore might respond best to recommendations)
- classification, e.g. cat vs not cat
- prediction, e.g. student A will score 92% on their next exam
- goal, e.g. beat someone at chess.

## If it’s all about data, isn’t this analytics?

Sort of. Machine learning shares a lot of DNA with analytics but is distinct. That said, they are often used in conjunction with one another, sometimes with the same data.

Machine learning is a combination of mathematical techniques applied to data, stitched together in specific workflows (algorithms). Many of these borrow heavily from statistics, probability, linear algebra. All of which are often used in analytics. So what’s the difference between machine learning and analytics?

Analytics is typically defined as **descriptive**. It’s used to interrogate data in order to describe **what happened **to help humans understand causation, review the performance of a product or process, draw insights and make decisions.

Machine learning is typically **predictive** or **prescriptive**. The former is about using data to anticipate **what will happen** based on other data (**regression**), or what something is (**classification**, e.g. is photo X “cat” or “not cat”?). The latter is about providing **recommendations** on **what to do** to achieve one or more goals, e.g. Amazon’s “Customers who bought X also bought Y” feature learns recommendations based on your buying habits correlated with those of other buyers with similar habits to your own.

## Linear Regression: the machine learning you didn’t know you knew

By now you hopefully have a roadmap for the world of machine learning, it’s major types and relationships with AI and analytics.

We’re going to explore linear regression in order to unpick the fundamental building blocks of most machine learning systems. Linear regression is a **predictive** and **supervised** machine learning technique.

Linear regression is the “hello world” of machine learning.

By this we mean it’s the first technique developers and machine learning engineers learn when deciding to study machine learning, often used to illustrate the basic concepts.

It’s used because it encompasses the principal components of most machine learning systems and is also something familiar from school level math.

As it is a supervised learning technique, it’s slightly more intuitive than unsupervised or reinforcement learning. The essential idea is that we give the system enough examples of some relationship between x and y and eventually the system learns the magic connection that links x and y, enabling it to perform a related task with that type of data. It’s essentially: learn by example. Lots of examples!

And the best part? You studied this at school!

## What is linear regression?

Linear regression is an algorithm that allows you to predict a given output **y** for a given input **x**. That’s it.

Going slightly deeper, it is a highly interpretable (i.e. easy for us to understand (and model) the relationship between an input and an output). It is also the standard method for modelling the past relationship between independent input variables (**x**) and dependent output variables (**y**, which can have an infinite number of values) to help predict future values of the output variables.

## What can you do with linear regression?

Predict stuff! A concrete example:

Given the number of contracts to be drafted for a new legal matter (

x) how many billables (y) will this cost the client based on historic examples of matters with the same or similar number of documents?

Why is this helpful? Well it would help scope legal matters and their cost, creating better fee estimates and a lower likelihood of exceeding a fee cap and upsetting the client.

## How does linear regression work?

Supervised machine learning algorithms such as linear regression have 8 components common to most machine learning systems:

#### Ingredients for a basic machine learning recipe

- An objective
- A dataset
- A model function
- A cost function
- An optimization function
- An iterative feedback loop
- Parameters
- Hyper parameters

We will explore each, layering them together. By the end, you’ll understand each and how they fit together.

#### 1. The Objective

To keep things simple our objective is:

Given a numberxwe want to predict a related valuey

This is what our system needs to learn, and what we will train the system to do.

In real life, x could be the number of documents anticipated for a new legal matter and y the predicted cost in billable hours. More complex systems would include other factors, e.g. number and type of lawyers needed, jurisdictions, some score regarding the complexity of the intended matter etc.

The objective, the understanding of the x and y relationship, is known as the **model** (more on that later).

#### 2. The Dataset

A dataset is simply the data you want your machine learning system to process in order to learn the desired behaviour.

In supervised machine learning, this will be:

- a collection of examples
- where each example is a pair of x (input) and y (output)
- where x is a
**feature**about a thing (a variable independent of y), e.g. “number of documents” - where y is a
**label**about a thing (a variable dependent on x), e.g. “$100,000” amount of billable hours

For this reason, it’s best to think about machine learning, especially supervised learning, as **thing labelling**.

Our dataset looks like this:

Each row in the above Dataset is an **example **relationship between known x and y values. Each x value is a **feature** and each y value a **label **(and the thing we want the system to predict).

Someone, a human, will need to find this data (**data** **collection**), decide the relevant features (**feature engineering**), apply the relevant labelling (**data labelling**), deal with any missing, incomplete or inconsistent data (**data cleaning**) and organise it into the proper structure (one or more tables of data) for consumption by the relevant algorithms (**data wrangling**). This needs to happen **before** the dataset is fed into the machine learning algorithms. As you can see, this is a lot of prep work!

We split the dataset into:

- A training dataset
**(green)** - A validation dataset
**(orange)** - A test dataset
**(red)**

As we will explain, we **train** the system on **training dataset** then **validate** its performance on the **validation dataset** and finally **test** the system on the **test dataset**.

Think of it like how you prep to pass exams.

When planning exam prep you gather a **dataset**. That dataset will comprise your **revision**, facts from which you study and learn the relationships about the subject matter.

You then use mock exams or past papers plus model answers to **validate** how well you understand and have revised the subject matter. If you do well on the mocks, great. If not, you review the model answers and go over your revision materials again to try and understand what you don’t.

Finally, you **test** your knowledge at the final exam. This proves your ability to **generalize** your understanding of a subject matter to new, sight unseen questions about it.

The hope is that provided you revised (**trained**), checked your knowledge and made any changes to your understanding via mock exams (**validation**) you should pass the exam (**test**). If you still fail the final exam, you probably need to go back to the drawing board and reconsider the structure and content of your revision.

We can plot our dataset like this, with x values along the x axis (horizontal) and y values along the y axis (vertical):

Our system’s goal is to learn *what* mathematical manipulation applied to x results in the correct value y. This is known as the **model.** As we shall see, this can also be represented graphically.

#### 3. The Model Function

Recall that the objective of our system is:

Given a number

xwe want to predict a related valuey

Machine learning engineers express this relationship with a mathematical equation, known as the **model function**. Mathematically this is:

y = mx + b

Let’s break this down.

Of the entire universe of possible x & y relationships and their constituent values, we already know a small subset which we’ve labelled and organised, i.e. our dataset. This means we can plug into the model function at least *some* x and y values, i.e. the ones we already know and have to hand.

But what about m and b?

At this point m and b are unknown to us: m and b are the magic numbers that produce the correct y value when applied to a given x value.

m is known as a **weight** or **parameter**, and b is known as the **bias unit**. The goal of our system is to identify the correct values of m and b such that our system always (or as close to always as we can) correctly predict y for a new x value. For these purposes, think of the bias unit as the point on the graph where our line of best fit needs to cut across the y axis.

In this way, we hope to pick any x value in the universe and predict the y value, whether or not we know that y value.

Why is that useful? Well, it helps us make predictions about the world around us or an outcome, e.g. number of billable hours likely, given an input, e.g. number of contracts to be negotiated.

So you can begin to build an intuition of what our system is learning toward, let’s plot the target model function our system needs to learn, or y = x + 1:

This new **green** line – the **line of best fit** – is something you calculated at school, or at least used your ruler to draw. Either way, you may recall the idea was to find the line through, or as close to, as many of the coordinates on your graph as possible. Note the line cuts across the y axis at y = 1, the bias unit value for the line of best fit (y = x + 1).

Once you had the line of best fit at school, you used your ruler to draw a vertical line up from a new x value and horizontally across from the line of best fit to find the corresponding y value (e.g. the **pink **lines).

Later you learned how to do this mathematically using linear algebra, including how to work out the line of best fit from the provided data.

This line of best fit represents the **model function**. It is what we need the system to learn.

In real life, neither the system designers nor the AI, will know this at the start of the process. I’ve stated the answer here so it’s easy to follow *what* the system needs to work towards in order to generate the correct y values for given x values.

Try it for yourself.

For example, for row 1 of the dataset where x = 1, we can plug this x value into our model function y = x + 1, which becomes y = 1 + 1 and returns y = 2. When we compare this against the correct y value for example 1, we see that it is also y = 2. Our model correctly predicts a y value for a given x value.

So how does the system “learn” its way toward the correct values of m and b?

**By guessing values of m and b over and over until its guesses stop improving. **

Initially, the guesses are random but mathematically tuned via a feedback loop.

So you’re saying a lot of machine learning is analysing data to make probabilistic guesses at things? Yup, pretty much.

Of course, there’s more complexity to this and nuance, but that’s the gist. It’s important to understand that machine learning is maths not minds. Even a neural network – inspired by a theoretical model of how brain neurons process an input and produce an output – is just maths. There is no semantic or symbolic understanding of anything similar to how we understand humans conceive the world and reason.

Does that matter? It depends. But that is a philosophical question beyond the scope of this article.

Ok, so philosophy aside. How does the system make these guesses, know if they are accurate or not and, depending on that analysis improve itself? Let’s find out!

#### 4. The Cost Function

The **cost function** calculates the gap between the system’s guess at y and the actual value of y for a given example. It’s also known as the **error function**.

Let’s return to the first example in our dataset. In that example x = 1 and y = 2. If the system predicted y = 10 for x the system can calculate the error by doing 10 – 2 = 8. In other words, the system was 8 off.

Why is this useful? The system has a measure of how well it is performing. Returning to the exam analogy. It’s a bit like having the questions (x values) and the answers (y values) but first covering up the answers with your hand and guessing at how to answer each question then removing your hand to see how you did. You now have a measure of how well you performed that you can learn from.

In reality the cost function math is more complex, and will also typically be the sum of the error for all guesses for all examples, but for simplicity it’s easier when starting to learn machine learning to think of the process on an example by example basis.

We’ll see how this comes together in more detail below.

#### 5. The Optimisation Function

The optimization function takes the learning from the cost function – i.e. how wrong (or right) the system is at guessing – and uses that new information to inform how best to update itself in order to maximise its chances at improving its guess at the next turn.

When we refer to “turns” this is because machine learning is iterative, which leads nicely into…

#### 6. An iterative feedback loop

You may have noticed we’ve built up a number of components and hinted at the overall process being turn-based. Well, you’d be correct. Machine learning is iterative. It’s about mathematically expedited trial and error.

So how does the dataset, model function, cost function and optimization function work together iteratively in order to learn the relationship between x and y? Let’s find out!

**First Pass, First Configuration**

At the start, neither the system engineers nor the system have any idea what m and b might be. Instead, the engineers initialize m and b to random values, e.g. m = 0.5 and b = 3.

These are plugged into our model function, e.g. y = mx + b becomes y = 0.5x + 3.

Next, the x value (but not the y value) for the first example in our training dataset (below) is added into the model function.

So the first configuration of the model function becomes: y = (0.5 x 1) + 3.

**First Pass, First Guess**

The system computes the Model Function for its first guess at y. With this configuration:

y = (0.5 x 1) + 3 y = 0.5 + 3 y = 3.5

So is the **Predicted Y Value **for the system’s first guess is y = 3.5.

**First Pass, First Check**

So how does the system know if y = 3.5 is correct for x in this example, and whether or not the current model function configuration is correct or needs tuning?

Simple, the system calculates the difference between:

The Predicted Y Value (y = 3.5) AND The Actual Y Value (y = 2) EQUALS 1. 5 (i.e. 3.5 - 2)

That calculation is made by the **cost function **and the resulting value, i.e. 1.5 in this example, is the **cost value **for the first guess. As we said, the actual math function is more complex and is typically calculated on the error for all guess for all samples to speed things up, but this is the core idea:

**Comparing the prediction with the actual result and calculating the difference.**

We can plot this version of the model function below (**orange** line), and highlight the cost (i.e. error) of its predictions against the actual data for our dataset.

The costs are the gaps (**red** arrows) between the **blue** points on the graph (our dataset of x + y pairs) and the current model function’s (y = 0.5x + 3) predictions for y, illustrated by the **orange** line.

As you can see, for the first example in our dataset where x = 1 the current model function’s prediction of y is off by +1.5 because it predicts y = 3.5 when it is in fact y = 2.

Recall that the goal of the model function is to find the **line of best fit** that passes through, or as close to, as many points as possible. So let’s find out how the model improves itself toward that goal!

**First Pass, First Update**

So far, the system has made its first guess, compared its guess against the actual answer to calculate the difference between its guess and the truth which turns out to be wrong by +1.5 (the **cost**).

Its goal at this step is to **identify what adjustment to its model will minimize the cost for its next guess**, i.e. so the next guess is closer to the actual y value given an input x.

The optimization function takes the current value of m, currently m = 0.5, and adjusts it by:

a. **PLUS** some small amount, e.g. +0.1; and

b. **MINUS** some small amount, e.g. -0.1,

and in each case, calculates whether each slightly adjusted m value will make the cost get worse (i.e. increases) or better (i.e. decreases).

This is essentially the mathematical equivalent of the **Hot or Cold** game where you are blindfolded and take small steps in different directions toward a hidden object and ask a friend to say “hotter” (i.e. closer to the object) or “colder” (i.e. further from the object).

In that game you decide to head in the direction that returned the “hottest” signal from your friend.

Without exploring the maths – which go a bit beyond school level algebra (unless you studied algebra beyond age 16 and know some basic calculus) – this is the intuition you need to understand.

**The optimisation function is like the Hot or Cold game.**

This same logic from Hot and Cold applies to the Optimization Function, e.g.

(A) PLUS AdjustmentOriginal m = 0.5. Adjusting m + 0.1 means new m = 0.6. Plugging this new m into model function we get y = (0.6 x 1) + 3 or y = 3.6. Plugging that into the cost function, i.e. difference between y = 3.6 (prediction) and y = 2 (actual value), i.e. 1.6, means the cost hasincreasedfrom 1.5 (when using the original m value) to 1.6 (using this micro adjusted m value). Because it increased the cost, we can say it gotworse.

(B) MINUSAdjustmentOriginal m = 0.5. Adjusting m – 0.1 means new m = 0.4. Plugging this new m into the model function we get y = (0.4 x 1) + 3 or y = 3.4. Plugging that into the cost function, i.e. difference between y = 3.4 (prediction) and y = 2 (actual value), i.e. 1.4, means the cost hasdecreasedfrom 1.5 (when using the original m) to 1.4 (using this micro adjusted m value). Because it decreased, we can say it gotbetter.

Because adjustment (B) reduced the cost, the system learns that it should **reduce m by some amount in that direction** (i.e. downwards) as this seems to be the correct direction that leads to a lower cost and a better prediction.

At this point, the system is basing this decision on **only having seen the first example and how this relates to the model function**.

With more examples, this understanding of micro-adjustments to m as they relate to the cost becomes more complex, nuanced and more accurate over time. Again, this is like in the Hot and Cold game where lots of little steps add up and with increasing speed take you to the hidden object.

To speed things up, the system designers use a **Learning Rate**, an arbitrary amount by which m is updated in the direction of the best micro adjustment from above.

Let’s assume the engineers set the Learning Rate as 0.25.

This means m is reset from the originally random value of m = 0.5 to m = 0.25 (i.e. 0.5 – 0.25 = 0.25). The system has **moved from a random configuration of m to a data-informed adjustment of those variables**.

The same process applied to m is applied to b at the same time (omitted in the above for simplicity), and for simplicity let’s assume b is reset to 1.5 after this step.

**Second Pass, Second Configuration**

The system then restarts using the adjusted values of m and b:

m = 0.25 and b = 1.5 (whereas before, for the first pass they were m = 0.5 and b = 3)

Together this updates the model function as follows:

From y = 0.5x + 3 (the configuration from the first pass) to y = 0.25x + 1.5

As before, we choose the next example in our dataset and input the x value into our model function, i.e.

This means the model function configuration at the second pass is this:

y = (0.25 x 2) + 1.5

Crunching that equation, the second guess for the second example is therefore y = 2.

**Second Pass, Second Check**

Same as for the first pass; the system checks this new Predicted Y Value (y = 2) against the Actual Y Value (y = 3) using the Cost Function to generate the new **cost**, i.e. 3 – 2= 1.

Much better!

Now the system is only +1.0 off of the correct value (y = 3) with respect to the second example vs. being off by +1.5 after the first guess with respect to the first example.

We can plot this new model function (y = 0.25x + 1.5) as follows:

Once again, the costs are the gaps (**red** arrows) between the **blue** points on the graph (our dataset of x + y pairs) and the current model function’s (y = 0.25x + 1.5) predictions for y, illustrated by the **orange** line. Again, notice the line now cuts the y axis at 1.5, the bias unit in the model function y = 0.25x + 1.5.

The system is **learning** the relationship between x and y, and has **improved** this understanding between the first guess for the first example and the second guess for the second example. Notice also, the new model function has gotten even closer to the first example. Likewise, notice that the model function (y = 0.25x + 1.5) is closer to the target model function (y = x + 1).

The more examples the system processes, the closer the line will adjust to all other points.

So we can say the system has used the **data** (the first example in our training dataset) to **learn how to adjust the weights**

**and bias unit**(aka parameters) in our

**model function**to tune the model function’s performance toward accurate predictions of y given x.

Our system isn’t quite there yet, but it’s on its way to learning the correct function that maps a given x input to its correct y output, which as we know is y = x + 1.

The next step for the second pass would be to run the optimization function as before. For brevity, we won’t do that, but hopefully you get the idea that this process is repetitive.

**Rinse and repeat through the training dataset**

We repeat the above process again and again for each example. As the system works through each example it will improve, minimising its costs to maximise its predictions (i.e. improving accuracy).

For a great visual of this in process on an entirely different and more varied dataset over 50 iterations (aka **epochs**) see the below:

**Validate**

Once our system has processed the training dataset per the above process we **freeze** the system so there are no further adjustments to the values of m and b. We then run the above process, on the **validation dataset**.

However, two key points to understand:

- We run all steps in the above process again on the validation dataset
**except the optimization function**. For validation, we only want to see how the system performs on new, sight unseen, data and do not use that information to update the system’s performance. - The validation dataset has not – as yet – been seen by the system. This is because we want to see how well the system performs on new, sight unseen data, but before the final exam (i.e. the test data). This data was not part of the training dataset so the system cannot have factored this into its learning prior to this validation process.

Recall the exam prep analogy. This is the step whereby after revising on our revision data we attempt a mock exam paper to test our knowledge **before** the real exam. We do this to test our understanding before the final exam and provide an **intermediate** measure of how well we understand new data (i.e. new facts and exam questions) based on our existing revision. If we score well on the mock exam, indications are good that we might do well on the **final exam**. If we do poorly, we have room for improvement and a chance to go back and adjust our revision strategy, methodology and materials to hopefully improve our performance.

This is the intuition behind validation. It’s the mock exam paper step in a student’s exam prep.

**Test**

This is where rubber meets the road. In our exam analogy, this is the **final exam**. Like the final exam, the idea is to test how well the student’s (in our case the system) learning **generalizes** to new sight unseen data.

As with validation, we run the above process of configurations, cost calculations but **not** the optimization function step.

**When things go poorly**

If the system performs worse on the validation dataset, or even the test dataset, what can the AI engineers do?

Lots of things.

These things are called **tuning the hyperparameters**. Let’s explore these below, including the question of which parts of the overall process were fixed by the engineers and those generated or adjusted by the AI.

#### 7. Hyperparameters

These are the AI engineer decided / controlled parts of an AI system. These include:

1. The choice of Model Function, e.g. y = mx + b rather than y = m_{1} x^{2 }+ m_{2} x^{2 }+ b

2. The initial choice of m and b

3. The choice of Learning Rate / adjustments thereto

4. The choice of the Cost Function

5. The choice of the Optimisation Function

6. The amount Training Data

7. The number and type of Training Data features

Regarding 7, in the worked example above we only use one feature about a thing, x, to predict a related value y. However, returning to the idea that x could be the number of documents on a matter and y the total hours billed to that matter, we could add more features to our model, e.g. the types of documents and their number, the number and seniority of lawyers on the matter, the size of each document, the number of versions for each document etc.

This may improve the model because we assume that the relationship of billable hours to a matter goes beyond the number of docs but aren’t certain which factors impact billables nor how we weigh these up to produce accurate billable hours estimates. If we add them to the dataset the system has more variables to weigh up in its attempts to find the mathematical relationship between all inputs and the desired output billable hours estimate.

It is these components of the AI system with which engineers systematically experiment to determine ways in which to boost the system’s performance, e.g. if the system performs well on the training dataset but less good on the validation set or test set.

#### 8. Parameters

So what dos the AI system generate or adjust within itself? The AI system generated parts are:

1. The intermediate and final choices of m (the **weights**)** **and b (**bias unit**)

2. The predicted y values for a given input x value

As you can see, the AI system updates *some *of its own code, i.e. the parameters in its model function, but a lot of the mechanics are fixed by the AI engineers. A common misconception is that AI systems have few fixed components and more or less write and rewrite the majority of their code. This isn’t strictly true, or not in the sense most non-technical individuals understand it.

## Machine Learning One Pager

And here it is summarized into one diagram!

## A legal snack

As you can see, an AI system’s linear regression setup has many different components.

Note that many of the components originate with different sources or processes.

For instance, in our AI system, there is an interesting legal analysis that could be applied regarding who owns what IP in the system regarding its inputs and its outputs.

Without exploring that in legal detail, suppose these questions:

Who owns the input data? If the engineers created it, perhaps them. But if they scraped it (i.e. downloaded it) from the internet what were they allowed to do with that data?

Did the additional steps of data cleaning, feature engineering, data labelling, data wrangling and so on create any new rights in that data or database?

If the engineers sell their system to a third party who adds their own data and re-trains the model to generate better outputs than the original engineers do they own that output, i.e the newly adjusted model function that provides higher performance?

Can we say that the AI generated the m, b and output values and therefore has some agency and ownership of these values, which have potential commercial value? What about the person providing the data, without which the system can’t learn anything?

Good questions to ponder!

## What next? (Further reading)

If you made it this far, **well done**. This was a hard article to write and simplify the underlying concepts, and we hope not hard to follow as a result! If you enjoyed it and want to learn more, we recommend you check out the below resources as a next step. In time we will write an article with a full worked + maths + code example of machine learning (linear regression most likely) and how it can be used in law.

Machine Learning for Everyone. This is a well organized roadmap to the world of machine learning, including lots of intutivie real world examples (no code, no math). It’s excellent as a next step in your learning, going into more detail on the other subtypes of machine learning and their variations.

Machine Learning is Fun. It’s an 8 part article series of excellent practical worked examples. Knowledge of some basic python and a deeper familiarity with maths will help, but isn’t strictly necessary to further your education.