Learning spaCy¶

spaCy is a production ready NLP library in python.

Install¶

In [1]:
pip install -U pip setuptools wheel
Requirement already satisfied: pip in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (25.3)
Requirement already satisfied: setuptools in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (80.9.0)
Requirement already satisfied: wheel in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (0.45.1)
Note: you may need to restart the kernel to use updated packages.
In [2]:
pip install -U spacy
Requirement already satisfied: spacy in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (3.8.11)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (1.0.15)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.0.13)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (3.0.12)
Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (8.3.10)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.5.2)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.4.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (0.4.3)
Requirement already satisfied: typer-slim<1.0.0,>=0.3.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (0.20.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (4.67.1)
Requirement already satisfied: numpy>=1.19.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.2.6)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.32.5)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.12.5)
Requirement already satisfied: jinja2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (3.1.6)
Requirement already satisfied: setuptools in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (80.9.0)
Requirement already satisfied: packaging>=20.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (25.0)
Requirement already satisfied: annotated-types>=0.6.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.5 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (2.41.5)
Requirement already satisfied: typing-extensions>=4.14.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.4.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2025.11.12)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from thinc<8.4.0,>=8.3.4->spacy) (1.3.3)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from thinc<8.4.0,>=8.3.4->spacy) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from typer-slim<1.0.0,>=0.3.0->spacy) (8.3.1)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from weasel<0.5.0,>=0.4.2->spacy) (0.23.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from weasel<0.5.0,>=0.4.2->spacy) (7.5.0)
Requirement already satisfied: wrapt in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.4.2->spacy) (2.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from jinja2->spacy) (3.0.3)
Note: you may need to restart the kernel to use updated packages.
In [95]:
!python -m spacy download en_core_web_md
Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.5/33.5 MB 80.0 MB/s  0:00:00m0:00:0100:01
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
In [96]:
import spacy

Named Entity Recognition¶

A real world object that is assigned a label.

In [123]:
ner_example_1 = "Apple is looking at buying U.K. startup for $1 billion"
ner_example_2 = "MS Now Justice and Intelligence Correspondent Ken Dilanian wrote, 'The suspect has been charged with placing the bombs, which did not detonate. The allegations, if proven, would end a longstanding mystery that sparked a multitude of conspiracy theories over who planted the pipe bombs before a mob of pro-Trump supporters stormed the Capitol aiming to stop Joe Biden from being installed as president. Authorities have not yet determined a motive, a law enforcement official said. But the suspect has been linked to statements in support of anarchist ideology, said two people briefed on the arrest.'"
# nlp = spacy.load("en_core_web_sm") # smaller model, does not include vectors
# nlp = spacy.load("en_core_web_lg") # large model, recommended
nlp = spacy.load("en_core_web_md")
In [98]:
doc1 = nlp(ner_example_1)
for ent in doc.ents:
    print(ent.text, ent.label_)
MS Now Justice ORG
Intelligence ORG
Ken Dilanian PERSON
Capitol FAC
Joe Biden PERSON
two CARDINAL
In [99]:
doc2 = nlp(ner_example_2)
for ent in doc2.ents:
    print(ent.text, ent.label_)
MS Now Justice ORG
Intelligence ORG
Ken Dilanian PERSON
Capitol ORG
Joe Biden PERSON
two CARDINAL
In [100]:
type(doc2.ents[0])
Out[100]:
spacy.tokens.span.Span

Part of speech tagging¶

In [101]:
for token in doc2:
    print(token.text, token.pos_)
MS PROPN
Now PROPN
Justice PROPN
and CCONJ
Intelligence PROPN
Correspondent PROPN
Ken PROPN
Dilanian PROPN
wrote VERB
, PUNCT
' PUNCT
The DET
suspect NOUN
has AUX
been AUX
charged VERB
with ADP
placing VERB
the DET
bombs NOUN
, PUNCT
which PRON
did AUX
not PART
detonate VERB
. PUNCT
The DET
allegations NOUN
, PUNCT
if SCONJ
proven VERB
, PUNCT
would AUX
end VERB
a DET
longstanding ADJ
mystery NOUN
that PRON
sparked VERB
a DET
multitude NOUN
of ADP
conspiracy NOUN
theories NOUN
over ADP
who PRON
planted VERB
the DET
pipe NOUN
bombs NOUN
before ADP
a DET
mob NOUN
of ADP
pro ADJ
- ADJ
Trump ADJ
supporters NOUN
stormed VERB
the DET
Capitol PROPN
aiming VERB
to PART
stop VERB
Joe PROPN
Biden PROPN
from ADP
being AUX
installed VERB
as ADP
president NOUN
. PUNCT
Authorities NOUN
have AUX
not PART
yet ADV
determined VERB
a DET
motive NOUN
, PUNCT
a DET
law NOUN
enforcement NOUN
official NOUN
said VERB
. PUNCT
But CCONJ
the DET
suspect NOUN
has AUX
been AUX
linked VERB
to ADP
statements NOUN
in ADP
support NOUN
of ADP
anarchist ADJ
ideology NOUN
, PUNCT
said VERB
two NUM
people NOUN
briefed VERB
on ADP
the DET
arrest NOUN
. PUNCT
' PUNCT
In [102]:
# verbs
verbs = [token.text for token in doc2 if token.pos_ == "VERB"]
print(verbs)
['wrote', 'charged', 'placing', 'detonate', 'proven', 'end', 'sparked', 'planted', 'stormed', 'aiming', 'stop', 'installed', 'determined', 'said', 'linked', 'said', 'briefed']
In [103]:
# nouns
nouns = [token.text for token in doc2 if token.pos_ == "NOUN"]
print(nouns)
['suspect', 'bombs', 'allegations', 'mystery', 'multitude', 'conspiracy', 'theories', 'pipe', 'bombs', 'mob', 'supporters', 'president', 'Authorities', 'motive', 'law', 'enforcement', 'official', 'suspect', 'statements', 'support', 'ideology', 'people', 'arrest']
In [125]:
# noun chunks
for chunk in doc2.noun_chunks:
    print(chunk.text, "-", chunk.root.text)
MS Now Justice and Intelligence Correspondent Ken Dilanian - Dilanian
The suspect - suspect
the bombs - bombs
which - which
The allegations - allegations
a longstanding mystery - mystery
that - that
a multitude - multitude
conspiracy theories - theories
who - who
the pipe bombs - bombs
a mob - mob
pro-Trump supporters - supporters
the Capitol - Capitol
Joe Biden - Biden
president - president
Authorities - Authorities
a motive - motive
a law enforcement official - official
the suspect - suspect
statements - statements
support - support
anarchist ideology - ideology
two people - people
the arrest - arrest

Visualization¶

In [105]:
from spacy import displacy

displacy.render(doc2, style="ent")
# use displacy.serve outside of jupyter notebook
MS Now Justice ORG and Intelligence ORG Correspondent Ken Dilanian PERSON wrote, 'The suspect has been charged with placing the bombs, which did not detonate. The allegations, if proven, would end a longstanding mystery that sparked a multitude of conspiracy theories over who planted the pipe bombs before a mob of pro-Trump supporters stormed the Capitol ORG aiming to stop Joe Biden PERSON from being installed as president. Authorities have not yet determined a motive, a law enforcement official said. But the suspect has been linked to statements in support of anarchist ideology, said two CARDINAL people briefed on the arrest.'
In [106]:
displacy.render(doc1, style="dep")
Apple PROPN is AUX looking VERB at ADP buying VERB U.K. PROPN startup NOUN for ADP $ SYM 1 NUM billion NUM nsubj aux prep pcomp compound dobj prep quantmod compound pobj

Sentence segmentation¶

In [107]:
for sentence in doc2.sents:
    print(sentence)
    print("-----")
MS Now Justice and Intelligence Correspondent Ken Dilanian wrote, 'The suspect has been charged with placing the bombs, which did not detonate.
-----
The allegations, if proven, would end a longstanding mystery that sparked a multitude of conspiracy theories over who planted the pipe bombs before a mob of pro-Trump supporters stormed the Capitol aiming to stop Joe Biden from being installed as president.
-----
Authorities have not yet determined a motive, a law enforcement official said.
-----
But the suspect has been linked to statements in support of anarchist ideology, said two people briefed on the arrest.'
-----

Similarity and Vectors¶

In [121]:
# doc similarity
similarity = doc1.similarity(doc2)
print(similarity)
0.6650668978691101
In [122]:
# token similarity
token = doc1[0]
print(token.text)
print(token.vector)
print(token.similarity(token))
Apple
[-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947
 -0.24269    0.28603    0.686      0.29737    0.30422    0.69032
  0.042784   0.023701  -0.57165    0.70581   -0.20813   -0.03204
 -0.12494   -0.42933    0.31271    0.30352    0.09421   -0.15493
  0.071356   0.15022   -0.41792    0.066394  -0.034546  -0.45772
  0.57177   -0.82755   -0.27885    0.71801   -0.12425    0.18551
  0.41342   -0.53997    0.55864   -0.015805  -0.1074    -0.29981
 -0.17271    0.27066    0.043996   0.60107   -0.353      0.6831
  0.20703    0.12068    0.24852   -0.15605    0.25812    0.007004
 -0.10741   -0.097053   0.085628   0.096307   0.20857   -0.23338
 -0.077905  -0.030906   1.0494     0.55368   -0.10703    0.052234
  0.43407   -0.13926    0.38115    0.021104  -0.40922    0.35972
 -0.28898    0.30618    0.060807  -0.023517   0.58193   -0.3098
  0.21013   -0.15557   -0.56913   -1.1364     0.36598   -0.032666
  1.1926     0.12825   -0.090486  -0.47965   -0.61164   -0.16484
 -0.41134    0.19925    0.059183  -0.20842    0.45223    0.27697
 -0.20745    0.025404  -0.28874    0.040478  -0.22275   -0.43323
  0.76957   -0.054327  -0.35213   -0.30842   -0.48791   -0.35564
  0.19813   -0.094767  -0.50918    0.18763   -0.087555   0.37709
 -0.1322    -0.096913  -1.9102     0.55813    0.27391   -0.077744
 -0.43933   -0.10367   -0.24408    0.41869    0.11659    0.27454
  0.81021   -0.11006    0.43131    0.29095   -0.49548   -0.31958
 -0.072506   0.020286   0.2179     0.22032   -0.29212    0.75639
  0.13598    0.019736  -0.83104    0.22836   -0.28669   -1.0529
  0.052771   0.41266    0.50149    0.5323     0.51573   -0.31806
 -0.4619     0.21739   -0.43584   -0.41382    0.042237  -0.57179
  0.067623  -0.27854    0.090044   0.20633    0.024678  -0.57703
 -0.020183  -0.53147   -0.37548   -0.12795   -0.093662  -0.0061183
  0.20221   -0.62296   -0.29746    0.26935    0.59009   -0.50382
 -0.69757    0.20157   -0.33592   -0.45766    0.14061    0.22982
  0.044046   0.26386    0.02942    0.34095    1.1496    -0.15555
 -0.064071   0.30139    0.024211  -0.63515   -0.73347   -0.10346
 -0.22637   -0.056392  -0.16735   -0.097331  -0.19206   -0.18866
  0.15116   -0.038048   0.70205    0.11586   -0.14813    0.0095166
 -0.33804   -0.10158   -0.23829   -0.22759    0.092504  -0.29839
 -0.39721    0.26092    0.34594   -0.47396   -0.25725   -0.19257
 -0.53071    0.1692    -0.47252   -0.17333   -0.40505    0.046446
 -0.04473    0.33555   -0.5693     0.31591   -0.21167   -0.31298
 -0.45923   -0.083091   0.086822   0.01264    0.43779    0.12651
  0.30156    0.022061   0.26549   -0.29455   -0.14838    0.033692
 -0.37346   -0.075343  -0.56498   -0.24207   -0.69351   -0.20277
 -0.0081185  0.030971   0.53615   -0.16613   -0.84087    0.74661
  0.029132   0.46936   -0.49755    0.40954   -0.022558   0.21497
 -0.049528  -0.039799   0.46165    0.26456    0.32985   -0.04219
 -0.099599  -0.17312   -0.476     -0.019048  -0.41888   -0.2685
 -0.65281    0.068773  -0.23881   -1.1784     0.25504    0.61171  ]
1.0
In [126]:
doc3 = nlp("The boy is running a race.")
doc4 = nlp("The boy is walking a race.")
doc3.similarity(doc4)
Out[126]:
0.9790079593658447
In [128]:
doc5 = nlp("The election was stolen.")
doc6 = nlp("The election was not stolen.")
doc5.similarity(doc6)
Out[128]:
0.9854341745376587

Spacy Pipelines¶

When calling nlp(), a Doc is produced. Via a config file you can define a pipeline of operations on the Doc by populating a components field in a config file read more here

You can view community built pipelines here

Propoganda pipeline: https://github.com/AkashSDE/PropagandaDetectionNLP, and a bigger project

Entity Linking¶

Entity linking is the practice of identifying entities and linking them with more data sets you might have to create visual graphics.

Detailed video here

NLP and more learning¶

https://applied-language-technology.mooc.fi/html/index.html

In [ ]: