A.I. systems, including legal ones, typically use a form of artificial intelligence known as machine learning (sometimes also rules and search). For the machine learning elements, a distinction is drawn between supervised learning vs unsupervised learning.
We’ll explain:
- what each of these mean;
- how they work, plus an example of each in a legal context;
- when to use each, and which of supervised learning vs unsupervised learning is better; and
- the out of the box = unsupervised learning misconception.
Supervised Learning
learning from labels
Supervised learning requires labelled data. That data is typically labelled by a domain expert, i.e. someone who is expert at identifying what labels go with what data. In the legal context, this will be a lawyer or legally trained individual.
In the consumer space, this is often you! For instance, Facebook is great at automatically tagging your friends in photos.
Why is that? It is because of the historical training you provided – and continue to provide – when manually tagging photos of your friends. Over time, with more examples of your friends in different conditions (lighting, angles and obscuring detail), Facebook’s algorithms learn how to tag photo A as “Arnold” and photo B as “Linda”.
Legal A.I. systems identifying and extracting clauses (or intra-clause data, e.g. a financial number such as rent amount) also achieve this via supervised learning.
For example, a legal A.I. due diligence tool may extract governing law from SPAs. To do so, either vendor or user provides the system with labelled examples of governing law clauses.
This process is known as training. In doing so a supervised machine learning algorithm is used to generate a predictive model.
A predictive model is a mathematical formula able to map a given input to the desired output, in this case, its predicted classification, i.e. the correct governing law. The model is predictive because it relies on statistical and probabilistic techniques to predict the correct governing law based on historical data.
A basic workflow describing the above process for the governing law example is shown below:

The above generates a predictive model mathematically optimised to predict whether a given combination of words is more or less likely to belong to a particular label.
In machine learning terms this type of supervised learning is known as classification, i.e. because we are building a system to classify something into one of two or more classes (i.e. governing laws).

Accurate though it might become, the model never understands neither the labels nor what it is labelling. As we always like to stress at lawtomated, machine learning is maths not minds.
If you are interested in digging deeper, check out our forthcoming guide to training, testing and cross-validation of machine learning systems, which are each fundamental concepts in any machine learning system, albeit usually abstracted or unavailable to the users of via the UI of legal A.I. systems.
Unsupervised Learning
Pattern spotting
Unlike supervised learning, unsupervised learning does not require labelled data. This is because unsupervised learning techniques serve a different process: they are designed to identify patterns inherent in the structure of the data.
A typical non-legal use case is to use a technique called clustering. This is used to segment customers into groups by distinct characteristics (e.g. age group) to better assign marketing campaigns, product recommendations or prevent churn.
A common legal use case for this technique is diagrammed below in the case of A.I. powered contract due diligence:

As the above illustrates we start with a disorganised bag of governing law clauses. An unsupervised technique such as clustering can be used to identify statistical patterns inherent in the data, clustering similar governing law clause formulations together but separate from dissimilar items.
In this example, the data scientist – or in some cases the end user to the extent such controls are exposed via a UI – can adjust the similarity threshold, typically a value between 0 and 1.
If set to 1 the algorithm will cluster together only identical items, i.e. identifying duplicates. This turns data – random clauses – into information we can use, i.e. we now understand the dataset contains duplicate data, which in turn may be a valuable insight.
If set to 0 the algorithm will cluster apart items that are entirely distinct from one another.
A setting between 0 and 1 will cluster data into varying cluster sizes and groupings. To be clear, a setting of 0.8 would cluster together clauses 80% similar. Users might use this to detect near duplicates, i.e. documents that are virtually but not entirely identical.
Which is better: supervised or unsupervised?
(Hint: You’re asking the wrong question)
Here’s a helpful analogy for the supervised learning vs unsupervised learning question.
Ask yourself: which is better, screwdriver or hammer?

The answer is neither. They serve similar but different purposes, albeit sometimes work hand in hand (literally) to achieve a bigger outcome, e.g. a set of shelves.
In the same way, when people ask the question – “Which is better supervised or unsupervised learning?” – the answer is neither, albeit they are often combined to achieve an end result.
For example, unsupervised learning is sometimes used to automatically preprocess data into logical groupings based on the distribution of the data, such as in the clause clustering example above. This might result in groupings based on the type of paperwork used for a contract type, e.g. all the contracts stemming from template A may fall into one cluster vs. those falling into a separate cluster. This turns data into useful information to the extent it was not previously known, nor immediately identifiable, by a human reviewer.
This may, in turn, assist human domain experts with their dataset labelling, e.g. by identifying which documents will most likely contain representative examples of the data points they wish to label at a more granular level and those which won’t. The subsequent labelling will then feed into a supervised learning algorithm that produces the final result, e.g. a due diligence report summary of red flag clauses in an M&A data room.

Out of the box vs. Unsupervised Learning
good vendors distinguish, bad vendors disguise
Any legal team buying an A.I. system will want to know which is best for them. Vendors in the crowded A.I. contract due diligence space typically provide one or both of two features:
- OOTB Extractors: these are product features pre-trained by the vendor to identify and extract popular contract provisions, e.g. governing law, termination, indemnity etc.
- Self-trained extractors: these are product features capable of training by the user to generate a user-specific predictive model for a contract provision of their choosing and design.
In either case, someone has to train the system with labelled data. This is because both techniques are supervised learning techniques of the sort described above.
Unfortunately, some vendors deliberately or by omission lead people (media, buyers and users) to believe that because something comes ready and working “out of the box” (aka “OOTB“) this means it uses unsupervised learning.
This is patently false: it will have been trained by the vendor if it is performing a classification task such as extracting clauses from contracts.
By extension, conflating OOTB Extractors with unsupervised learning is usually intended to suggest their solution is superior to products without such features, i.e. because it “requires no training” or worse implies the system “just learns by itself”. Again, this is inaccurate and misleading.
OOTB Extractors vs. Self-trained Extractors
Another bake-off!
Flowing from the above, and as with the earlier point about which of supervised vs. unsupervised learning is better, so too the question of OOTB Extractors vs. Self-trained Extractors.
Recall both are supervised learning techniques. The differences however are these:
OOTB Extractors | Self-trained Extractors | |
Who | Vendor trained | User trained |
What | Public data, e.g. filings at SEC, Companies House, etc | User’s data, e.g. document management system (“DMS“) but also public data to the extent users curate a dataset from public sources |
How | Good vendors actively disclose this in some detail. Usually involves a senior lawyer deciding on the initial labelling methodology and examples, which is then replicated by junior lawyers / law students across a wider dataset. This is then iterated alongside vendor’s technical team to gradually improve performance. Bad vendors will not disclose this process in any detail. Understanding this is vital as a buyer: you are trusting someone else, their methodology and dataset – are you sure the who, what and how meet your needs and quality controls? | Depends on the application and the user’s own methodology. In theory, it should mirror something similar to what the vendor has done (to the extent the vendor has invested efforts to train and sell its own provisions). Unfortunately, a lot of self-training features are not well workflowed, explained or robust enough to manage in the same way knowledge lawyers expect to create and curate knowledge (e.g. versioning, access and editing permissions etc). |
Pros | Ready to use out of the box | Bespoked to user needs, not the market’s |
Cons | Trained on public data, which may be biassed toward certain languages, jurisdictions and / or document types. For instance, many vendors use data sources from the SEC filing system in the USA and UK Companies House, both of which bias toward English language documents with a UK or US centric focus and, with regard to the SEC, only certain types of companies and documents. Training methodology and quality of vendor side trainers may be less experienced than user’s either in general, or with regard to the specific domain challenges of a user’s practice area or business need. Often these provisions are locked in the sense the user cannot “top up” the provision with additional user training, either to improve its accuracy in general or to tailor it toward a specific variation on a data point, e.g. tuning an OOTB Extractor trained on lease assignment clauses to work with leveraged finance agreement assignment clauses (which are quite different!) | Requires training, both the users in how to train the system, and the trained user training of the system itself. Self-training features are usually underdeveloped and do not provide the typical controls a knowledge management lawyer might expect to find, e.g. version control, access / read / write permissions for model curation and sharing purposes. Nor do such tools provide a UI with a similar degree of finesse available to a data scientist with regard to tuning the test, training and cross-validation datasets. This can be a blocker to uptake and adoption of the product and / or the self-training features, i.e. because it’s hard to properly tune models let alone version them to protect their integrity. |
Conclusion
Hopefully, you’ve learnt:
- What is supervised learning.
- What is unsupervised learning.
- How each of the above work (at a high level).
- A basic use case example of supervised learning vs unsupervised learning.
- The key difference for most legal use cases: that supervised learning requires labelled data to predict labels for new data objects whereas unsupervised learning does not require labels and instead mathematically infers groupings.
- That neither supervised learning nor unsupervised learning is objectively better; each serves different purposes, albeit can be (and often are) used in combination to achieve a larger goal.
- That unsupervised learning and OOTB pre-trained extractors are not the same, that the latter is, in fact, supervised learning (albeit trained by the vendor) and doesn’t simply “learn by itself”!
- The who, what, how, pros and cons of OOTB pre-trained extractors vs. self-trained extractors.
If you want to learn more about artificial intelligence, check out this article. If you’re interested to appreciate the differences between machine learning and deep learning head over to here.