Want to apply some machine learning magic to text but not quite ready to rip out your math/statistics textbooks? Or don't think you have enough data to really do anything interesting?
I remember first encountering a lot of these ideas in the NLTK book. It's a great resource, but for anyone just getting into natural language processing, it can be a bit much. There are many ways to train models to work with text, many techniques even outside of machine learning you can use to understand it, lots of corpus names and even more ways of organizing everything. NLTK is a potpourri of tools for working with text, which is great if you're really looking to experiment, but I like a bit more of a roadmap when entering into completely new territory.
One of the things I love most about spaCy is that it's opinionated. It gives you some of the most important techniques and tools for dealing with text and organizes them really well. It does let you go beyond the decisions it makes for you, but if your main interest is just building some more intelligent systems, you may never feel the need to rethink how they set things up.
Let's get what we need installed first. I'll assume that you've got a Python 3 environment where you want to install spaCy and its dependent libraries. If not, Anaconda is a great way to get Python for doing lots of data science stuff, including this, and it's always a good idea to create a separate environment for experimenting, whether it's a conda env or a regular ol' Python venv. As long as that's in order, you'll need just a few commands:
pip install spacy
python -m spacy download en_core_web_sm
# Only if you want to do the 2nd "Word math!" section
python -m spacy download en_core_web_md
Alright then! Let's go! Here are three ways you can do some stuff that would probably require quite a few more if
statements without spaCy, all without needing to understand machine learning enough to do your own training.
RegEx++
If you're trying to validate or extract any sort of information programmatically from raw text, regular expressions can save you from a lot of tedium. Want to check that a user entered a phone number like 555-555-5555?
import re
def check_phone_number(str):
if re.match(r'^\d{3}-\d{3}-\d{4}$', str):
print("Valid phone number")
else:
print("Not valid")
check_phone_number("555-555-5555") # Valid phone number
check_phone_number("123-456-7890") # Valid phone number
check_phone_number("123-4256-7890") # Not valid
Or say you've got some text with a little structure to it and want to extract some of the more structured information:
import re
text = """
Name: Bob
Address: 444 Somewhere Ave
Hi, my name is Bob, and I'd like you to extract my name and address from this block of text!
"""
name_match = re.search(r'.*?Name:\s*(.*?)\n', text)
address_match = re.search(r'.*?Address:\s*(.*?)\n', text)
name = None
address = None
if name_match:
name = name_match.group(1)
if address_match:
address = address_match.group(1)
print(f"{name} lives at {address}") # Bob lives at 444 Somewhere Ave
While regular expressions are great when there's at least a bit of predictable structure to the text you're trying to extract information from, they're less helpful when we only have linguistic structure. To deal with that, you need to be able to somehow cope with ambiguity of words, based on how they're used (ex: the same word may be a noun or a verb depending on how it's used in a sentence) and variation of words that have the same root meaning (ex: adding "-ing" to a word, to show continuous action doesn't change what the action is). Using just basic regular expressions to understand language at this level would result in some incredibly complex code.
This is where natural lanugage processing techniques like part of speech tagging and lemmatization come in. While part of speech tagging is generally done by a trained predictive model these days, spaCy makes it easy to use a pre-trained model. In other words, while under the hood you're using a machine learning model to tag each word as a noun, verb, adjective, adverb, etc., you just have to call a function, like you would to do any non ML-based transformation.
Once you've tagged the word tokens with parts of speech, you can do simple things like look for verbs and nouns, but you can also apply some linguistic knowledge and look for patterns, in the same way you would look for character-based patterns via a RegEx. Consider a fairly classic problem of finding not just nouns but noun phrases. One basic pattern that will find a lot of noun phrases (though it will also miss some) is: (optional) determiner (one or more) adjectives (one) noun.
With spaCy, you can look for something like this with a Matcher:
import spacy
from spacy.util import filter_spans
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
def noun_chunks(text):
doc = nlp(text)
pattern = [
{'POS': 'DET', 'OP': '?'},
{'POS': 'ADJ', 'OP': '*'},
{'POS': 'NOUN'}
]
matcher = Matcher(nlp.vocab)
matcher.add('NOUN_PHRASE', None, pattern)
matches = matcher(doc)
spans = [doc[start:end] for match_id, start, end in matches]
return filter_spans(spans)
One thing I want to point out in the above is the filter_spans
function. You'll be able to find everything else pretty easily in spaCy's excellent documentation, but this one's a bit tucked away. Without it, if you have a phrase like 'the yellow dog'
, you'll get 'the yellow dog'
, 'yellow dog'
, and 'dog'
as matches. That's a bit different from the sort of default behaviour I'm used to when working with regular expressions. In any event, what we want here, and what you'll probably often want, is to get only the largest matching spans. It's not too difficult to write your own function for this, but it's always nice when you can find one written for you, especially from the same library.
Anyway, let's use this on some real text! For that, I'm going to grab the first paragraph from chapter 3 of Great Expectations:
It was a rimy morning, and very damp. I had seen the damp lying on the outside of my little window, as if some goblin had been crying there all night, and using the window for a pocket-handkerchief. Now, I saw the damp lying on the bare hedges and spare grass, like a coarser sort of spiders' webs; hanging itself from twig to twig and blade to blade. On every rail and gate, wet lay clammy, and the marsh mist was so thick, that the wooden finger on the post directing people to our village--a direction which they never accepted, for they never came there--was invisible to me until I was quite close under it. Then, as I looked up at it, while it dripped, it seemed to my oppressed conscience like a phantom devoting me to the Hulks.
Assuming you have the above function defined:
text = """
It was a rimy morning, and very damp. I had seen the damp lying on the
outside of my little window, as if some goblin had been crying there all
night, and using the window for a pocket-handkerchief. Now, I saw the
damp lying on the bare hedges and spare grass, like a coarser sort of
spiders' webs; hanging itself from twig to twig and blade to blade. On
every rail and gate, wet lay clammy, and the marsh mist was so thick,
that the wooden finger on the post directing people to our village--a
direction which they never accepted, for they never came there--was
invisible to me until I was quite close under it. Then, as I looked up
at it, while it dripped, it seemed to my oppressed conscience like a
phantom devoting me to the Hulks.
""".replace("\n", " ")
for chunk in noun_chunks(text):
print(chunk)
You should get something pretty close to this:
a rimy
morning
the damp
the outside
my little window
some goblin
all night
the window
a pocket
handkerchief
the damp
the bare hedges
spare grass
a coarser sort
spiders
webs
twig
twig
every rail
gate
the marsh
mist
the wooden finger
the post
people
our village
a direction
conscience
a phantom
And that's actually pretty decent for a single, simple rule. You can get slightly better results by changing the final piece of the pattern from {'POS': 'NOUN'}
to {'POS': 'NOUN', 'OP': '+'}
.
a rimy morning
the damp
the outside
my little window
some goblin
all night
the window
a pocket
handkerchief
the damp
the bare hedges
spare grass
a coarser sort
spiders
webs
twig
twig
every rail
gate
the marsh mist
the wooden finger
the post
people
our village
a direction
conscience
a phantom
This groups phrases like 'a rimy morning'
and 'the marsh mist'
better than the first approach.
Now, if all we're interested in are noun phrases, spaCy already has a much easier way of getting those:
doc = nlp(text)
for noun_chunk in list(doc.noun_chunks):
print(noun_chunk)
It
a rimy morning
I
the damp
the outside
my little window
some goblin
the window
a pocket-handkerchief
I
the damp
the bare hedges
spare grass
a coarser sort
spiders' webs
itself
twig
twig
every rail
gate
the marsh mist
the wooden finger
the post
people
our village
a direction
they
they
me
I
it
I
it
it
it
my oppressed conscience
a phantom
me
the Hulks
spaCy's obviously doing a bit more, allowing it to catch stuff like 'a pocket-handkerchief'
, 'my oppressed conscience'
, and 'spiders' webs'
, along with various pronouns. If we wanted to, we could add extra match patterns to catch these cases.
It's important to note that, because spaCy's POS-tagging is using a statistical model, it can still come up with incorrect tags for words, especially if you're operating with text that's in a very different domain from what spaCy's models were trained on. So you may still end up doing some actual data collection and machine learning. spaCy thankfully makes the latter pretty easy as well. But you can get pretty far with some simple rules and the default models. And this is barely scratching the surface of what you can put in your patterns.
If you wanted to find parts of text where people are talking about running, for example, you could make good use of the 'LEMMA'
option:
pattern = [
{'LEMMA': 'run'}
]
This will catch any of 'run'
, 'runs'
, 'ran'
, or 'running'
. You can also use named entities, a host of more targeted checks like 'LIKE_NUM'
or 'LIKE_URL'
, and even your own custom properties that you've added.
Word math!
Word embeddings can let you do some pretty wild stuff. They're the product of machine learning, but thankfully you don't have to know much about how they're made to use them. Okay, I'm lying. Actually, you should know just a bit about how they're made before using them so that you're less likely to misuse them.
Basically, by looking at a lot of text and seeing what words show up together, you can effectively come up with numerical definitions of the words, which basically end up being big vectors of numbers. Now, are we going to be losing some important information about particular words here? Yep. But it tends to do a pretty good job of mapping words to a multi-dimensional space where similar words will be close to each other and very different words will be further apart. And the classic example of the seemingly magical things you can do with these is if you take the vector for 'king'
, subtract 'man'
, and add 'woman'
, you'll get a vector back that's very close to 'queen'
.
Hopefully this gets you thinking a bit critically about word embeddings. Yes, that example is nifty. It's almost as if we've got some actual human-like intelligence going on. But remember that these embeddings are derived from a whole bunch of existing text and by the time you get to a 300 value vector, it's getting pretty hard to debug what the actual understanding of a particular word is. You could pretty easily create an incredibly biased algorithm using something like the above. It's easy to forget because we're used to thinking of automated systems as the antithesis of human bias, or at least the kind of human bias that can easily be audited. Although there's a lot of research going into dealing with the easy bias that can be encoded into word vectors, it's best to be cautious.
That said, there are still many great uses for word vectors. spaCy makes one particular use really easy: detecting similarity. With spaCy, you can use single words or entire documents, though as you might imagine, the larger the block of text, the more skeptical you'll want to be of the answers. But let's take a look at something fairly simple:
import spacy
nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like pizza.")
doc2 = nlp("I like fishing.")
doc3 = nlp("I like Italian food.")
doc4 = nlp("I like playing guitar.")
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc1.similarity(doc4))
Now, if we were ranking doc2
, doc3
, and doc4
based on how similar they are to doc1
, how might they be ordered? We'd probably put doc3
as the closest. And between doc2
and doc4
, we might think that doc2
would be closer because it at least involves something closely related to a food we might eat, whereas doc4
isn't anything close to food (hopefully). You might expect all of them to not be too far apart, given they're all about liking things.
And sure enough, we get:
0.8469626942011074
0.9200865018661752
0.8000981974639525
Again, you want to be careful about how you use this. Even forgetting the above digression about the dangers of word vectors, turning similarity into a single number obviously throws away a lot of nuance. But if you're using it to suggest potentially duplicate text or help search for something similar to a user's query, you're probably relatively safe.
Find people, places, and things
The last of these is probably the most complex under the hood, as it's going to use not only the statistical model for understanding parts of speech, but one built using the parts of speech as features to recognize named entities. It's also probably the easiest part of spaCy to get direct value from, so long as you're interested in the pre-trained categories and one of the models works well enough for you.
spaCy's documentation describes a named entity as a "real-world object". That's perhaps a bit simplistic, as it includes things like countries, events, monetary values, and time as named entities, but "real-world object" gets you thinking in the right direction. A fuzzier description that makes a bit more sense to me would be something that would jump out in the text if a human was reading it. But really, spaCy has some of the best documentation I've read for understanding how to really use and think about named entities, so I can't recommend it enough.
Let's see a quick example:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "It's 2019, and David Ackerman is a software developer in Edmonton, Alberta."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.label_}: {ent.text}")
If you run this, you should get the following:
DATE: 2019
PERSON: David Ackerman
GPE: Edmonton
GPE: Alberta
One way to use this sort of thing would be to automate different ways of browsing a large corpus of text without having to label each chunk of text yourself. Or, as I mentioned above, you could create rules using these entities as even higher level features than basic parts of speech. What about looking for PERSON
entities with an 'is a'
relationship to a noun phrase to find out what people do in a set of news articles?
Conclusion
There's a lot more you can do with spaCy, especially if you're willing to do a bit of machine learning. In fact, I love how easy it makes training your own models. It makes a lot of decisions for you and the existing functionality is pretty powerful, but the creators have also put a lot of thought into letting you override many of those decisions when you need to.