In [1]:
pip install -U pip setuptools wheel
Requirement already satisfied: pip in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (25.3) Requirement already satisfied: setuptools in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (80.9.0) Requirement already satisfied: wheel in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (0.45.1) Note: you may need to restart the kernel to use updated packages.
In [2]:
pip install -U spacy
Requirement already satisfied: spacy in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (3.8.11) Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (3.0.12) Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (1.0.5) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (1.0.15) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.0.13) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (3.0.12) Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (8.3.10) Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (1.1.3) Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.5.2) Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.0.10) Requirement already satisfied: weasel<0.5.0,>=0.4.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (0.4.3) Requirement already satisfied: typer-slim<1.0.0,>=0.3.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (0.20.0) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (4.67.1) Requirement already satisfied: numpy>=1.19.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.2.6) Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.32.5) Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (2.12.5) Requirement already satisfied: jinja2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (3.1.6) Requirement already satisfied: setuptools in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (80.9.0) Requirement already satisfied: packaging>=20.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from spacy) (25.0) Requirement already satisfied: annotated-types>=0.6.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.7.0) Requirement already satisfied: pydantic-core==2.41.5 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (2.41.5) Requirement already satisfied: typing-extensions>=4.14.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (4.15.0) Requirement already satisfied: typing-inspection>=0.4.2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.4.2) Requirement already satisfied: charset_normalizer<4,>=2 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4.4) Requirement already satisfied: idna<4,>=2.5 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.11) Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2025.11.12) Requirement already satisfied: blis<1.4.0,>=1.3.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from thinc<8.4.0,>=8.3.4->spacy) (1.3.3) Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from thinc<8.4.0,>=8.3.4->spacy) (0.1.5) Requirement already satisfied: click>=8.0.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from typer-slim<1.0.0,>=0.3.0->spacy) (8.3.1) Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from weasel<0.5.0,>=0.4.2->spacy) (0.23.0) Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from weasel<0.5.0,>=0.4.2->spacy) (7.5.0) Requirement already satisfied: wrapt in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.4.2->spacy) (2.0.1) Requirement already satisfied: MarkupSafe>=2.0 in /Users/samueldowds/.pyenv/versions/3.10.19/lib/python3.10/site-packages (from jinja2->spacy) (3.0.3) Note: you may need to restart the kernel to use updated packages.
In [95]:
!python -m spacy download en_core_web_md
Collecting en-core-web-md==3.8.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.5/33.5 MB 80.0 MB/s 0:00:00m0:00:0100:01
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
In [96]:
import spacy
Named Entity Recognition¶
A real world object that is assigned a label.
In [123]:
ner_example_1 = "Apple is looking at buying U.K. startup for $1 billion"
ner_example_2 = "MS Now Justice and Intelligence Correspondent Ken Dilanian wrote, 'The suspect has been charged with placing the bombs, which did not detonate. The allegations, if proven, would end a longstanding mystery that sparked a multitude of conspiracy theories over who planted the pipe bombs before a mob of pro-Trump supporters stormed the Capitol aiming to stop Joe Biden from being installed as president. Authorities have not yet determined a motive, a law enforcement official said. But the suspect has been linked to statements in support of anarchist ideology, said two people briefed on the arrest.'"
# nlp = spacy.load("en_core_web_sm") # smaller model, does not include vectors
# nlp = spacy.load("en_core_web_lg") # large model, recommended
nlp = spacy.load("en_core_web_md")
In [98]:
doc1 = nlp(ner_example_1)
for ent in doc.ents:
print(ent.text, ent.label_)
MS Now Justice ORG Intelligence ORG Ken Dilanian PERSON Capitol FAC Joe Biden PERSON two CARDINAL
In [99]:
doc2 = nlp(ner_example_2)
for ent in doc2.ents:
print(ent.text, ent.label_)
MS Now Justice ORG Intelligence ORG Ken Dilanian PERSON Capitol ORG Joe Biden PERSON two CARDINAL
In [100]:
type(doc2.ents[0])
Out[100]:
spacy.tokens.span.Span
Part of speech tagging¶
In [101]:
for token in doc2:
print(token.text, token.pos_)
MS PROPN Now PROPN Justice PROPN and CCONJ Intelligence PROPN Correspondent PROPN Ken PROPN Dilanian PROPN wrote VERB , PUNCT ' PUNCT The DET suspect NOUN has AUX been AUX charged VERB with ADP placing VERB the DET bombs NOUN , PUNCT which PRON did AUX not PART detonate VERB . PUNCT The DET allegations NOUN , PUNCT if SCONJ proven VERB , PUNCT would AUX end VERB a DET longstanding ADJ mystery NOUN that PRON sparked VERB a DET multitude NOUN of ADP conspiracy NOUN theories NOUN over ADP who PRON planted VERB the DET pipe NOUN bombs NOUN before ADP a DET mob NOUN of ADP pro ADJ - ADJ Trump ADJ supporters NOUN stormed VERB the DET Capitol PROPN aiming VERB to PART stop VERB Joe PROPN Biden PROPN from ADP being AUX installed VERB as ADP president NOUN . PUNCT Authorities NOUN have AUX not PART yet ADV determined VERB a DET motive NOUN , PUNCT a DET law NOUN enforcement NOUN official NOUN said VERB . PUNCT But CCONJ the DET suspect NOUN has AUX been AUX linked VERB to ADP statements NOUN in ADP support NOUN of ADP anarchist ADJ ideology NOUN , PUNCT said VERB two NUM people NOUN briefed VERB on ADP the DET arrest NOUN . PUNCT ' PUNCT
In [102]:
# verbs
verbs = [token.text for token in doc2 if token.pos_ == "VERB"]
print(verbs)
['wrote', 'charged', 'placing', 'detonate', 'proven', 'end', 'sparked', 'planted', 'stormed', 'aiming', 'stop', 'installed', 'determined', 'said', 'linked', 'said', 'briefed']
In [103]:
# nouns
nouns = [token.text for token in doc2 if token.pos_ == "NOUN"]
print(nouns)
['suspect', 'bombs', 'allegations', 'mystery', 'multitude', 'conspiracy', 'theories', 'pipe', 'bombs', 'mob', 'supporters', 'president', 'Authorities', 'motive', 'law', 'enforcement', 'official', 'suspect', 'statements', 'support', 'ideology', 'people', 'arrest']
In [125]:
# noun chunks
for chunk in doc2.noun_chunks:
print(chunk.text, "-", chunk.root.text)
MS Now Justice and Intelligence Correspondent Ken Dilanian - Dilanian The suspect - suspect the bombs - bombs which - which The allegations - allegations a longstanding mystery - mystery that - that a multitude - multitude conspiracy theories - theories who - who the pipe bombs - bombs a mob - mob pro-Trump supporters - supporters the Capitol - Capitol Joe Biden - Biden president - president Authorities - Authorities a motive - motive a law enforcement official - official the suspect - suspect statements - statements support - support anarchist ideology - ideology two people - people the arrest - arrest
Visualization¶
In [105]:
from spacy import displacy
displacy.render(doc2, style="ent")
# use displacy.serve outside of jupyter notebook
MS Now Justice
ORG
and
Intelligence
ORG
Correspondent
Ken Dilanian
PERSON
wrote, 'The suspect has been charged with placing the bombs, which did not detonate. The allegations, if proven, would end a longstanding mystery that sparked a multitude of conspiracy theories over who planted the pipe bombs before a mob of pro-Trump supporters stormed the
Capitol
ORG
aiming to stop
Joe Biden
PERSON
from being installed as president. Authorities have not yet determined a motive, a law enforcement official said. But the suspect has been linked to statements in support of anarchist ideology, said
two
CARDINAL
people briefed on the arrest.'
In [106]:
displacy.render(doc1, style="dep")
Sentence segmentation¶
In [107]:
for sentence in doc2.sents:
print(sentence)
print("-----")
MS Now Justice and Intelligence Correspondent Ken Dilanian wrote, 'The suspect has been charged with placing the bombs, which did not detonate. ----- The allegations, if proven, would end a longstanding mystery that sparked a multitude of conspiracy theories over who planted the pipe bombs before a mob of pro-Trump supporters stormed the Capitol aiming to stop Joe Biden from being installed as president. ----- Authorities have not yet determined a motive, a law enforcement official said. ----- But the suspect has been linked to statements in support of anarchist ideology, said two people briefed on the arrest.' -----
Similarity and Vectors¶
In [121]:
# doc similarity
similarity = doc1.similarity(doc2)
print(similarity)
0.6650668978691101
In [122]:
# token similarity
token = doc1[0]
print(token.text)
print(token.vector)
print(token.similarity(token))
Apple [-0.6334 0.18981 -0.53544 -0.52658 -0.30001 0.30559 -0.49303 0.14636 0.012273 0.96802 0.0040354 0.25234 -0.29864 -0.014646 -0.24905 -0.67125 -0.053366 0.59426 -0.068034 0.10315 0.66759 0.024617 -0.37548 0.52557 0.054449 -0.36748 -0.28013 0.090898 -0.025687 -0.5947 -0.24269 0.28603 0.686 0.29737 0.30422 0.69032 0.042784 0.023701 -0.57165 0.70581 -0.20813 -0.03204 -0.12494 -0.42933 0.31271 0.30352 0.09421 -0.15493 0.071356 0.15022 -0.41792 0.066394 -0.034546 -0.45772 0.57177 -0.82755 -0.27885 0.71801 -0.12425 0.18551 0.41342 -0.53997 0.55864 -0.015805 -0.1074 -0.29981 -0.17271 0.27066 0.043996 0.60107 -0.353 0.6831 0.20703 0.12068 0.24852 -0.15605 0.25812 0.007004 -0.10741 -0.097053 0.085628 0.096307 0.20857 -0.23338 -0.077905 -0.030906 1.0494 0.55368 -0.10703 0.052234 0.43407 -0.13926 0.38115 0.021104 -0.40922 0.35972 -0.28898 0.30618 0.060807 -0.023517 0.58193 -0.3098 0.21013 -0.15557 -0.56913 -1.1364 0.36598 -0.032666 1.1926 0.12825 -0.090486 -0.47965 -0.61164 -0.16484 -0.41134 0.19925 0.059183 -0.20842 0.45223 0.27697 -0.20745 0.025404 -0.28874 0.040478 -0.22275 -0.43323 0.76957 -0.054327 -0.35213 -0.30842 -0.48791 -0.35564 0.19813 -0.094767 -0.50918 0.18763 -0.087555 0.37709 -0.1322 -0.096913 -1.9102 0.55813 0.27391 -0.077744 -0.43933 -0.10367 -0.24408 0.41869 0.11659 0.27454 0.81021 -0.11006 0.43131 0.29095 -0.49548 -0.31958 -0.072506 0.020286 0.2179 0.22032 -0.29212 0.75639 0.13598 0.019736 -0.83104 0.22836 -0.28669 -1.0529 0.052771 0.41266 0.50149 0.5323 0.51573 -0.31806 -0.4619 0.21739 -0.43584 -0.41382 0.042237 -0.57179 0.067623 -0.27854 0.090044 0.20633 0.024678 -0.57703 -0.020183 -0.53147 -0.37548 -0.12795 -0.093662 -0.0061183 0.20221 -0.62296 -0.29746 0.26935 0.59009 -0.50382 -0.69757 0.20157 -0.33592 -0.45766 0.14061 0.22982 0.044046 0.26386 0.02942 0.34095 1.1496 -0.15555 -0.064071 0.30139 0.024211 -0.63515 -0.73347 -0.10346 -0.22637 -0.056392 -0.16735 -0.097331 -0.19206 -0.18866 0.15116 -0.038048 0.70205 0.11586 -0.14813 0.0095166 -0.33804 -0.10158 -0.23829 -0.22759 0.092504 -0.29839 -0.39721 0.26092 0.34594 -0.47396 -0.25725 -0.19257 -0.53071 0.1692 -0.47252 -0.17333 -0.40505 0.046446 -0.04473 0.33555 -0.5693 0.31591 -0.21167 -0.31298 -0.45923 -0.083091 0.086822 0.01264 0.43779 0.12651 0.30156 0.022061 0.26549 -0.29455 -0.14838 0.033692 -0.37346 -0.075343 -0.56498 -0.24207 -0.69351 -0.20277 -0.0081185 0.030971 0.53615 -0.16613 -0.84087 0.74661 0.029132 0.46936 -0.49755 0.40954 -0.022558 0.21497 -0.049528 -0.039799 0.46165 0.26456 0.32985 -0.04219 -0.099599 -0.17312 -0.476 -0.019048 -0.41888 -0.2685 -0.65281 0.068773 -0.23881 -1.1784 0.25504 0.61171 ] 1.0
In [126]:
doc3 = nlp("The boy is running a race.")
doc4 = nlp("The boy is walking a race.")
doc3.similarity(doc4)
Out[126]:
0.9790079593658447
In [128]:
doc5 = nlp("The election was stolen.")
doc6 = nlp("The election was not stolen.")
doc5.similarity(doc6)
Out[128]:
0.9854341745376587
Spacy Pipelines¶
When calling nlp(), a Doc is produced. Via a config file you can define a pipeline of operations on the Doc by populating a components field in a config file read more here
You can view community built pipelines here
Propoganda pipeline: https://github.com/AkashSDE/PropagandaDetectionNLP, and a bigger project
Entity Linking¶
Entity linking is the practice of identifying entities and linking them with more data sets you might have to create visual graphics.
NLP and more learning¶
In [ ]: