SpaCy is a popular Python library for natural language processing (NLP) that provides various tools and components for text analysis, such as tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more. One of the useful features of spaCy is its rule-based matching engine, which allows you to find phrases and tokens in text based on custom patterns.
In this blog post, we will learn how to use the PhraseMatcher class in spaCy to efficiently match large terminology lists in text. We will also see how the PhraseMatcher works internally and how to use it with different token attributes and callbacks.
Let's begin :)
What is PhraseMatcher ⁉️
📌 PhraseMatcher is a tool that lets you find words and phrases in text based on patterns. A pattern is a way of describing what kind of words or phrases you want to find.
For example ⬇️
- A pattern could be
“a cat”
or “tree kangaroo
”. - You can make a list of patterns that you want to find in text, and PhraseMatcher will find them for you.
- PhraseMatcher is different from another tool called Matcher, which also lets you find words and phrases in text based on patterns.
- The difference is that Matcher uses rules to describe the patterns, while PhraseMatcher uses examples.
- For example, with Matcher, you can use rules like “
find a word that is a noun
” or “find a word that starts with ‘t’”. - With PhraseMatcher, you use examples like “
cat
” or “tree
”.
PhraseMatcher is good for finding words and phrases that are simple and fixed, like names of things or places. ⭕
Matcher is good for finding words and phrases that are more complex and flexible, like descriptions or expressions.⛔
How to use PhraseMatcher ⁉️
To use PhraseMatcher, you need to do these steps:⬇️
- Import the PhraseMatcher tool from spaCy.
- Make a PhraseMatcher object with a vocab object.
- The vocab object is a way of storing information about words, like how they look and what they mean.
- You need to use the same vocab object for the text that you want to find words and phrases in.
3. Choose what kind of information you want to use to find words and phrases in text. By default, PhraseMatcher uses the way the words look (the ORTH attribute)
.
- You can also use other kinds of information, like how the words sound (the LOWER attribute), what they mean (the LEMMA attribute), what kind of words they are (the POS attribute), or what kind of things they are (the ENT_TYPE attribute).
4. Add one or more patterns to the PhraseMatcher using the add method.
- The add method takes a name (a string or a number) and a list of examples (Doc objects) as arguments.
- You can also give a function that will do something when PhraseMatcher finds a match (a callback function).
5. Use the PhraseMatcher on a text (a Doc or a Span object) to find all the matches in it.
- The PhraseMatcher gives you a list of matches (tuples) that have the name, the start index, and the end index of each match.
Example : 🫡
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
# Make a list of examples
animals = ["cat", "dog", "tree kangaroo", "giant sea spider"]
examples = [nlp(animal) for animal in animals]
matcher.add("AnimalList", examples)
# Make a text
text = "tree kangaroos and giant sea spiders are two of Australia's most unusual animals"
text = nlp(text)
# Find matches in text
matches = matcher(text)
print(matches)
# [(3766102292120407359, 0, 2), (3766102292120407359, 4, 7)]
- The output shows that there are two matches found by PhraseMatcher: “
tree kangaroo
” and “giant sea spider
”. - Each match has a name (“
AnimalList
”), and a start and end index of the part of the text where it was found.
We can also get the parts of the text where the matches were found by using the indices:😁
for name, start, end in matches:
part = text[start:end]
print(part.text)
# tree kangaroo
# giant sea spider
How does PhraseMatcher work ⁉️
- PhraseMatcher works by using a smart way of finding words and phrases in text called Aho-Corasick algorithm.
- This way can find all the words and phrases in text very fast.
- The way works by making a machine (a
trie
) from the patterns and then looking at the text using this machine. - The picture below shows how the way works on our example:
- The way starts from the top of the machine and follows the arrows based on the letters in the text.
- Whenever it gets to a place that has an output, it means that a match has been found.
- For example, when it gets to the place with output “cat”, it means that the pattern “cat” has been found in the text.
- Similarly, when it gets to the place with output “tree kangaroo”, it means that the pattern “tree kangaroo” has been found in the text.
- The way can handle matches that overlap as well. For example, if the text has “caterpillar”, the way will find both “cat” and “ter” as matches, since they are both parts of “caterpillar”. However, PhraseMatcher will only give you the longest match, which is “caterpillar” in this case.
How to use different kinds of information with PhraseMatcher ⁉️
- By default, PhraseMatcher uses the way the words look (the ORTH attribute) to find words and phrases in text.
- However, you can also use other kinds of information, like how the words sound (the LOWER attribute), what they mean (the LEMMA attribute), what kind of words they are (the POS attribute), or what kind of things they are (the ENT_TYPE attribute).
- To do this, you need to tell PhraseMatcher what kind of information you want to use when you make the PhraseMatcher object.
- For example, if you want to use how the words sound, you can use attr=“LOWER”:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
- This will make PhraseMatcher not care about uppercase or lowercase letters, meaning that it will find “Cat” and “cat” as the same pattern.
Similarly, if you want to use what kind of words they are, you can use attr= “POS”:⬇️
matcher = PhraseMatcher(nlp.vocab, attr="POS")
- This will make PhraseMatcher use the part-of-speech tag of each word, no matter what the word is.
- For example, if you add a pattern like
[nlp(“a cat”)]
to the matcher, it will find any word that is a determiner followed by any word that is a noun in the text, such as “a dog
”, “the tree
”, “an elephant”
, etc.
Note that when you use kinds of information other than ORTH, you need to make sure that the patterns and the text have been processed by spaCy so that they have those kinds of information.
For example, if you want to use LEMMA or POS, you need to use spaCy that has a lemmatizer or a tagger tool.
How to use functions with PhraseMatcher?
- PhraseMatcher also lets you give a function that will do something when PhraseMatcher finds a match (a callback function).
- The function takes three things:
- The PhraseMatcher object
- The text object.
- The list of matches.
- You can use the function to do different things with the matches, such as joining them into one word, giving them a custom name, adding them to the named things in
doc.ents
etc. - For example, if you want to join all matches into one word and give them a custom name “
ANIMAL
”, you can use a function like this:
def animal_function(matcher, text, i, matches):
# Get the current match and make a part
name, start, end = matches[i]
part = text[start:end]
# Join part into one word and give name
part.merge(label="ANIMAL")
# Return text
return text
# Add function to matcher
matcher.add("AnimalList", examples, on_match=animal_function)
Copy
Now, when we use PhraseMatcher on a text object, it will not only find the matches but also join them and name them:
text = nlp("tree kangaroos and giant sea spiders are two of Australia's most unusual animals")
matches = matcher(text)
print(matches)
# [(3766102292120407359, 0, 2), (3766102292120407359, 4, 7)]
# Print words and names
for word in text:
print(word.text, word.label_)
# tree ANIMAL
# kangaroos ANIMAL
# and
# giant ANIMAL
# sea ANIMAL
# spiders ANIMAL
# are
# two
Take away :✈️
- PhraseMatcher can match on different kinds of information, such as the way the words look, sound, mean, or what kind of things they are. For example, you can use PhraseMatcher to find all the words that are nouns, or all the words that are names of countries, or all the words that have the same lemma as “run”. This makes PhraseMatcher very flexible and powerful for finding patterns in text. 🚀
- PhraseMatcher uses an efficient algorithm called Aho-Corasick algorithm to find all the words and phrases in text very fast. This algorithm works by making a machine from the patterns and then looking at the text using this machine. The machine can find multiple matches at the same time and handle matches that overlap. This makes PhraseMatcher very fast and accurate for finding patterns in text. 🏎️
- PhraseMatcher also lets you give a function that will do something when it finds a match. For example, you can use a function to join the matches into one word, give them a custom name, or add them to the named entities in doc.ents. This makes PhraseMatcher very useful and creative for modifying and analyzing the text based on the matches. 🎨
Conclusion 🎉
In this blog post, we have learned how to use PhraseMatcher in spaCy to search for words and phrases in text. We have also learned how PhraseMatcher works internally and how to use it in different ways. PhraseMatcher is a powerful tool that can help us find what we are looking for in text quickly and easily.
I hope you have enjoyed this blog post and learned something new. Happy searching! 😊