Natural language processing, sometimes known as NLP, is a subfield of linguistics concerned with the study of how to make computers understand genuine human language. Natural Language Processing (NLP) may be accomplished with the help of a Python package known as NLTK (which stands for Natural Language Toolkit).
A significant portion of the data that you may be studying is unstructured data, which means that it includes language that is legible by humans. You will first need to preprocess the data in order to conduct an analysis of it using a programming language. In this lesson, you will get an introduction to the many sorts of text preprocessing jobs that you may do using NLTK. This will prepare you to use these skills in future projects and give you a head start. You will also learn the fundamentals of text analysis as well as the creation of various visualisations.
You have found the perfect site if you are comfortable with the fundamentals of utilising Python and would want to dip your toes into the world of natural language processing (NLP). You’ll be able to do the following by the time this lesson is over:
-
Find text
to analyze -
Preprocess
your text for analysis -
Analyze
your text -
Create
visualizations
based on your analysis
Let’s get
Pythoning!
Getting Started With Python’s NLTK
The first thing that you need to do is check to see whether you already have Python installed on your computer. Python 3.9 is the version you should use for this lesson. Check out the Python 3 Installation & Setup Instructions if you haven’t already installed Python on your computer. It will help you get started.
When you have taken care of everything, the next step is to install NLTK using the pip package. Installing it in a virtual environment is the recommended method of doing so. Check out the book Python Virtual Environments: A Primer if you’re interested in learning more about virtual environments.
Installing version is what you’ll be doing for this instruction.
3.5:
$ python -m pip install nltk==3.5
In addition to installing NumPy and Matplotlib, you will need to install Visualizations for Named Entity Recognition in order to be able to build them.
:
$ python -m pip install numpy matplotlib
Check see the article on “What Is Pip?” if you are interested in learning more about the functionality of pip. An Introduction to Python for Beginners. You may also check out the official website for further information on installing the NLTK data.
.
Tokenizing
Tokenization provides a straightforward method for splitting up text either word-by-word or sentence-by-sentence. Because of this, you will be able to deal with more manageable chunks of text that maintain a reasonable degree of coherence and meaningfulness even when removed from the larger context of the rest of the text. It is the initial stage in the process of transforming unstructured data into structured data, which can then be analysed more easily.
While doing an analysis of the text, you will first tokenize it word by word, and then tokenize it phrase by sentence. The following is an explanation of what each sort of tokenization brings to the table:
table:
-
Tokenizing by word:
Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word “Python” comes up often. That could suggest high demand for Python knowledge, but you’d need to look deeper to know more. -
Tokenizing by sentence:
When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of
herpetology
than the domain of software development, suggesting that you may be dealing with an entirely different kind of
python
than you were expecting?
This is how to import the necessary components of NLTK so that you may tokenize both by word and by phrase.
sentence:
>>>
>>> from nltk.tokenize import sent_tokenize, word_tokenize
You are now able to build a string to tokenize since you have imported the necessary components. You are welcome to use the following quotation from Dune:
use:
>>>
>>> example_string = """
... Muad'Dib learned rapidly because his first training was in how to learn.
... And the first lesson of all was the basic trust that he could learn.
... It's shocking to find how many people do not believe they can learn,
... and how many more believe learning to be difficult."""
You may use the sent tokenize() function to separate the components of example string.
sentences:
>>>
>>> sent_tokenize(example_string)
["Muad'Dib learned rapidly because his first training was in how to learn.",
'And the first lesson of all was the basic trust that he could learn.',
"It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."]
You will have a list of three strings that are relevant after tokenizing example string sentence by sentence.
sentences:
-
"Muad'Dib learned rapidly because his first training was in how to learn."
-
'And the first lesson of all was the basic trust that he could learn.'
-
"It's shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult."
To finish, make an attempt to tokenize example string using.
word:
>>>
>>> word_tokenize(example_string)
["Muad'Dib",
'learned',
'rapidly',
'because',
'his',
'first',
'training',
'was',
'in',
'how',
'to',
'learn',
'.',
'And',
'the',
'first',
'lesson',
'of',
'all',
'was',
'the',
'basic',
'trust',
'that',
'he',
'could',
'learn',
'.',
'It',
"'s",
'shocking',
'to',
'find',
'how',
'many',
'people',
'do',
'not',
'believe',
'they',
'can',
'learn',
',',
'and',
'how',
'many',
'more',
'believe',
'learning',
'to',
'be',
'difficult',
'.']
You were given a set of characters that NLTK interprets as being words, such as
as:
-
"Muad'Dib"
-
'training'
-
'how'
The following strings, however, were also taken into consideration to be
words:
-
"'s"
-
','
-
'.'
Note how the word “It’s” was broken up into the words “It” and “‘s” at the apostrophe, while the word “Muad’Dib” was preserved in its entirety? This occurred due to the fact that NLTK is aware that “It” and “‘s” (a contraction of “is”) are two independent words, and as a result, it counted them in their own respective categories. Nevertheless, as “Muad’Dib” is not a recognised contraction in the same way that “It’s” is, it was not interpreted as two different words and was left out of the sentence.
intact.
Filtering Stop Words
Stop words are words that you wish to ignore, so while you’re processing your text, you filter them out so that they don’t appear anywhere in it. As they don’t provide much in the way of new meaning to a piece of writing on their own, very common words such as “in,” “is,” and “an” are often utilised as stop words in writing.
The process of importing the necessary components of NLTK in order to remove stop words is outlined here.
words:
>>>
>>> nltk.download("stopwords")
>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize
You are free to use this quote from Worf in any way you see fit.
filter:
>>>
>>> worf_quote = "Sir, I protest. I am not a merry man!"
The next step is to tokenize worf quote word by word, and then store the resultant list in words in quote.
:
>>>
>>> words_in_quote = word_tokenize(worf_quote)
>>> words_in_quote
['Sir', ',', 'protest', '.', 'merry', 'man', '!']
You now have a list of the words that are included in the worf quote, so the following step is to develop a set of stop words to filter the words that are contained in the words in quote. For the purpose of this illustration, you will need to concentrate on stop words in the language “english.”
:
>>>
>>> stop_words = set(stopwords.words("english"))
Next, construct a list with no items on it to keep track of the terms that make it through the first step.
filter:
>>>
>>> filtered_list = []
You created a new list called filtered list, which is empty, in order to keep track of all the terms in words in quote that are not stop words. You may now filter words-in-quote with the help of stopwords.
:
>>>
>>> for word in words_in_quote:
... if word.casefold() not in stop_words:
... filtered_list.append(word)
You used a for loop to cycle through the words in quote variable, and then you added all of the words to the filtered list variable that were not stop words. When you used the.casefold() function to word, you were able to disregard whether the characters in word were all uppercase or all lowercase. This is something that should be done since the stopwords.words(‘english’) package only contains the lowercase forms of stop words.
You may also use a list comprehension to compile a list of all of the words in your passage that aren’t stop words.
words:
>>>
>>> filtered_list = [
... word for word in words_in_quote if word.casefold() not in stop_words
... ]
When you use a list comprehension, you do not begin by generating a list that is initially empty and then proceed to append things to the list’s conclusion. You have to specify both the list and its contents simultaneously as an alternative. The use of a list comprehension is often considered to be more Pythonic.
Have a look at the terms that were ultimately selected for the filtered list.
:
>>>
>>> filtered_list
['Sir', ',', 'protest', '.', 'merry', 'man', '!']
You selected out a few words like “am” and “a,” but you also filtered out “not,” which does influence the sentence’s overall meaning. This is because “not” is a common preposition. (Worf is not going to take well to this development.)
Words such as “I” and “not” may seem to be too significant to filter out, and depending on the kind of analysis you wish to do, they may very well be too crucial to filter out. Here’s
why:
-
'I'
is a pronoun, which are context words rather than content words:-
Content words
give you information about the topics covered in the text or the sentiment that the author has about those topics. -
Context words
give you information about writing style. You can observe patterns in how authors use context words in order to quantify their writing style. Once you’ve quantified their writing style, you can analyze a text written by an unknown author to see how closely it follows a particular writing style so you can try to identify who the author is.
-
-
'not'
is
technically an adverb
but has still been included in
NLTK’s list of stop words for English
. If you want to edit the list of stop words to exclude
'not'
or make other changes, then you can
download it
.
Hence, the words “I” and “not” may play a vital role in a phrase, but their significance is contingent on the information that you want to glean from it.
sentence.
Stemming
One of the tasks involved in processing text is called “stemming,” and it involves reducing words to their roots, which are the fundamental components of a word. Stemming enables you to zero in on the fundamental meaning of a word rather than focusing on all of the specifics of how it is being used. For instance, the terms “helping” and “helper” share the root “help.” Stemming may be found in both English and Latin. There are many other stemmers included in NLTK; however, you will be use the Porter stemmer.
This is how to import the necessary components of NLTK so that you can get started.
stemming:
>>>
>>> from nltk.stem import PorterStemmer
>>> from nltk.tokenize import word_tokenize
You are now able to use PorterStemmer to generate a stemmer after finishing the import process ()
:
>>>
>>> stemmer = PorterStemmer()
The next thing you need to do is make a string to attach to the stem. One more that you may use:
use:
>>>
>>> string_for_stemming = """
... The crew of the USS Discovery discovered many discoveries.
... Discovering is what explorers do."""
Separating all of the words in that string is a prerequisite for stemming the words in it.
it:
>>>
>>> words = word_tokenize(string_for_stemming)
Examine the contents of the words now that you have a list of all of the tokenized words that were extracted from the string.
:
>>>
>>> words
['The',
'crew',
'of',
'the',
'USS',
'Discovery',
'discovered',
'many',
'discoveries',
'.',
'Discovering',
'is',
'what',
'explorers',
'do',
'.']
By using the stemmer.stem() function in a list, you may get a list that contains the stemmed versions of the words in words.
comprehension:
>>>
>>> stemmed_words = [stemmer.stem(word) for word in words]
Have a look at the content of the stemmed words.
:
>>>
>>> stemmed_words
['the',
'crew',
'of',
'the',
'uss',
'discoveri',
'discov',
'mani',
'discoveri',
'.',
'discov',
'is',
'what',
'explor',
'do',
'.']
The following is what became of all the terms that had previously begun with “discov” or “Discov.”
:
Original word | Stemmed version |
---|---|
|
|
|
|
|
|
|
|
These findings seem to have some degree of inconsistency. Why would the word “Discovery” give you the word “discoveri” whereas the word “Discovering” would give you the word “discov”?
There are two possible outcomes for the stemming process: understemming and overstemming.
wrong:
-
Understemming
happens when two related words should be reduced to the same stem but aren’t. This is a
false negative
. -
Overstemming
happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a
false positive
.
The Porter stemming method was developed in 1979, making it one of the more established ones in the field. You may use the Snowball stemmer, which is also known as the Porter2 stemmer, in your own projects since it is also accessible via NLTK. The original stemmer has been improved, and the new version can be found here. It is also important to note that the Porter stemmer is not designed to create whole words but rather to identify several variants of the same word. This is an important distinction.
You are fortunate to have access to a variety of additional methods for distilling words down to their essential meanings, such as lemmatizing, which will be covered in a later section of this guide. But before we get into it, let’s go over some of the
speech.
Tagging Parts of Speech
A “part of speech” is a grammatical term that refers to the functions that words fulfil when they are used in conjunction with other words to form sentences. The process of classifying the words in your text according to the part of speech they belong to is referred to as POS tagging, which stands for “tagged parts of speech.”
There are a total of eight sections to the English language.
speech:
Part of speech | Role | Examples |
---|---|---|
Noun | Is a person, place, or thing | mountain, bagel, Poland |
Pronoun | Replaces a noun | you, she, we |
Adjective | Gives information about what a noun is like | efficient, windy, colorful |
Verb | Is an action or a state of being | learn, is, go |
Adverb | Gives information about a verb, an adjective, or another adverb | efficiently, always, very |
Preposition | Gives information about how a noun or pronoun is connected to another word | from, about, at |
Conjunction | Connects two other words or phrases | so, because, and |
Interjection | Is an exclamation | yay, ow, wow |
While some sources include the category articles (such as “a” or “the”) in the list of parts of speech, other sources consider these words to be adjectives instead. The term “article” is synonymous with the word “determiner” in NLTK.
The following is a guide on how to import the necessary components of NLTK in order to tag components of
speech:
>>>
>>> from nltk.tokenize import word_tokenize
Next, provide some text that will be tagged. You are free to use this citation from Carl Sagan.
:
>>>
>>> sagan_quote = """
... If you wish to make an apple pie from scratch,
... you must first invent the universe."""
You may separate the words in that string by using the word tokenize function to save them in a separate variable.
list:
>>>
>>> words_in_sagan_quote = word_tokenize(sagan_quote)
Now, on your updated list of items, use nltk.pos tag().
words:
>>>
>>> import nltk
>>> nltk.pos_tag(words_in_sagan_quote)
[('If', 'IN'),
('you', 'PRP'),
('wish', 'VBP'),
('to', 'TO'),
('make', 'VB'),
('an', 'DT'),
('apple', 'NN'),
('pie', 'NN'),
('from', 'IN'),
('scratch', 'NN'),
(',', ','),
('you', 'PRP'),
('must', 'MD'),
('first', 'VB'),
('invent', 'VB'),
('the', 'DT'),
('universe', 'NN'),
('.', '.')]
Every word in the quotation is now organised into its own tuple, and each word has a tag that indicates the section of speech it belongs to. So what exactly does each tag mean? This is how to get a list of tags and their corresponding values:
meanings:
>>>
>>> nltk.help.upenn_tagset()
The list is pretty extensive, but if you would want to view it, feel free to enlarge the box that is located below.
The following is a synopsis that might serve as a starting point for your work with NLTK’s POS.
tags:
Tags that start with | Deal with |
---|---|
|
Adjectives |
|
Nouns |
|
Adverbs |
|
Pronouns |
|
Verbs |
Since you now understand what the POS tags signify, you can verify that your tagging was done in an accurate manner.
successful:
-
'pie'
was tagged
NN
because it’s a singular noun. -
'you'
was tagged
PRP
because it’s a personal pronoun. -
'invent'
was tagged
VB
because it’s the base form of a verb.
So how exactly would NLTK go about identifying the various elements of speech in a text that is essentially gibberish? Jabberwocky is a kind of nonsensical poetry that, despite the fact that it does not technically mean very much, is composed in such a manner that it may nonetheless communicate some sort of meaning to those who know English.
Construct a string that will store a portion of this.
poem:
>>>
>>> jabberwocky_excerpt = """
... 'Twas brillig, and the slithy toves did gyre and gimble in the wabe:
... all mimsy were the borogoves, and the mome raths outgrabe."""
To separate the words in the extract and save them in a different location, use word tokenize.
list:
>>>
>>> words_in_excerpt = word_tokenize(jabberwocky_excerpt)
Use the nltk.pos tag() function with your updated list of
words:
>>>
>>> nltk.pos_tag(words_in_excerpt)
[("'T", 'NN'),
('was', 'VBD'),
('brillig', 'VBN'),
(',', ','),
('and', 'CC'),
('the', 'DT'),
('slithy', 'JJ'),
('toves', 'NNS'),
('did', 'VBD'),
('gyre', 'NN'),
('and', 'CC'),
('gimble', 'JJ'),
('in', 'IN'),
('the', 'DT'),
('wabe', 'NN'),
(':', ':'),
('all', 'DT'),
('mimsy', 'NNS'),
('were', 'VBD'),
('the', 'DT'),
('borogoves', 'NNS'),
(',', ','),
('and', 'CC'),
('the', 'DT'),
('mome', 'JJ'),
('raths', 'NNS'),
('outgrabe', 'RB'),
('.', '.')]
Words that are considered standard in English, such as “and” and “the,” were appropriately classified as conjunctions and determiners, respectively. It was determined that the nonsense word’slithy’ should be classified as an adjective, which is also the interpretation that a human English speaker would most likely arrive at based on the overall context of the poem. Way to go,
NLTK!
Lemmatizing
You may now go back to lemmatizing now that you have a better understanding of the various components of speech. Lemmatizing is a process that, like stemming, breaks down words into their fundamental meanings. But, unlike stemming, lemmatizing will give you a whole English word that can stand on its own and make sense, rather than simply a fragment of a word like “discoveri.” Remember that a lexeme refers to a collection of words as a whole, while a lemma is a single word that stands in for an entire lexeme.
If you were to look up the term “blending” in a dictionary, for instance, you would first need to look at the entry for the word “blend,” but you would find “blending” mentioned in that entry as well.
In this particular instance, “blend” serves as the lemma, while “blending” is included as a component of the lexeme. When you reduce a word to its lemma, this process is referred to as lemmatization.
This is how to import the necessary components of NLTK so that you can get started.
lemmatizing:
>>>
>>> from nltk.stem import WordNetLemmatizer
Develop a lemmatizer in order to
use:
>>>
>>> lemmatizer = WordNetLemmatizer()
Let’s begin by lemmatizing a form of the plural
noun:
>>>
>>> lemmatizer.lemmatize("scarves")
'scarf'
You got the word “scarf” when you searched for “scarves,” which is already a little bit more refined than the word “scarv” that you would have received if you had searched for “Porter stemmer.” Create a string that contains more than one word as the next step.
lemmatize:
>>>
>>> string_for_lemmatizing = "The friends of DeSoto love scarves."
Tokenize the string by using now.
word:
>>>
>>> words = word_tokenize(string_for_lemmatizing)
Your list is as follows:
words:
>>>
>>> words
['The',
'friends',
'of',
'DeSoto',
'love'
'scarves',
'.']
Make a list that contains all of the words in words once they have been
lemmatized:
>>>
>>> lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
You may find the list here.
got:
>>>
>>> lemmatized_words
['The',
'friend',
'of',
'DeSoto',
'love',
'scarf',
'.'
That seems to be the case. The singular forms of the plural nouns “friends” and “scarves” have become “friend” and “scarf,” respectively.
If you lemmatized a word that looked considerably different from its lemma, what would the result be? You may try lemmatizing the word “worst.”
:
>>>
>>> lemmatizer.lemmatize("worst")
'worst'
The lemmatizer.lemmatize() function made the assumption that “worst” was a noun, which is why you obtained the answer “worst.” You are able to make it quite apparent that you want “worst” to be an option.
adjective:
>>>
>>> lemmatizer.lemmatize("worst", pos="a")
'bad'
You made sure that “worst” was recognised as an adjective by adding the parameter pos=”a” despite the fact that the default value for the pos parameter is “n,” which stands for noun. You ended up with the word “bad,” which is visually quite distinct from the word you started with and is nothing at all like the outcome you would have gotten if you had stemmed it. The reason for this is because “worst” is the superlative form of the adjective “bad,” and lemmatizing converts superlatives and comparatives to their respective lemmas.
You can try tagging your words before lemmatizing them to avoid mixing up homographs, which are words that have the same spelling but different meanings and can be different parts of speech. You now know how to use NLTK to tag parts of speech, so you can try tagging your words before lemmatizing them.
speech.
Chunking
Tokenizing makes it possible to recognise individual words and sentences, but chunking makes it possible to recognise whole phrases. Note: A phrase is a word or set of words that functions as a single unit to accomplish a grammatical function. Phrases may be made up of single words or several words. The focus of construction in a noun phrase is the noun itself.
Below are some
examples:
- “A planet”
- “A tilting planet”
- “A swiftly tilting planet”
POS tags are used in the chunking process in order to group words and then chunk tags are applied to those groups. Since chunks do not overlap with one another, it is only possible for a given word to appear in a single chunk at a time.
The following is a guide on how to import the necessary components of NLTK in order to:
chunk:
>>>
>>> from nltk.tokenize import word_tokenize
Before you can chunk, you need to make sure that the parts of speech in your text are tagged. To do this, you need to produce a string that can be used for POS tagging. You are free to use this passage from The Lord of the Rings in your work.
:
>>>
>>> lotr_quote = "It's a dangerous business, Frodo, going out your door."
Tokenize the string by using now.
word:
>>>
>>> words_in_lotr_quote = word_tokenize(lotr_quote)
>>> words_in_lotr_quote
['It',
"'s",
'a',
'dangerous',
'business',
',',
'Frodo',
',',
'going',
'out',
'your',
'door',
'.']
You should now have a list of all of the words that are included in the lotr quote variable.
The next step is to tag the words according to which portion of
speech:
>>>
>>> nltk.download("averaged_perceptron_tagger")
>>> lotr_pos_tags = nltk.pos_tag(words_in_lotr_quote)
>>> lotr_pos_tags
[('It', 'PRP'),
("'s", 'VBZ'),
('a', 'DT'),
('dangerous', 'JJ'),
('business', 'NN'),
(',', ','),
('Frodo', 'NNP'),
(',', ','),
('going', 'VBG'),
('out', 'RP'),
('your', 'PRP$'),
('door', 'NN'),
('.', '.')]
You have a list of tuples that contains all of the words in the quotation, along with their respective POS tags. Before you can chunk, you must first specify what is known as a chunk grammar. Please take note that a chunk grammar is just a collection of rules that dictate how sentences should be chunked. Regular expressions, often known as regexes, are used rather frequently.
It is not necessary for you to understand how regular expressions function in order to follow along with this lesson; nevertheless, if you ever wish to process text, you will find that having this knowledge will be quite helpful.
Make use of only one regular expression to construct chunk grammar.
rule:
>>>
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
The abbreviation NP stands for “noun phrase.” In the book Natural Language Processing with Python—Analyzing Text with the Natural Language Toolkit, Chapter 7 is dedicated to discussing noun phrase chunking in further detail.
In accordance with the regulation that you devised, your
chunks:
-
Start with an optional (
?
) determiner (
'DT'
) -
Can have any number (
*
) of adjectives (
JJ
) -
End with a noun (
<NN>
)
Using this, you may build a chunk parser.
grammar:
>>>
>>> chunk_parser = nltk.RegexpParser(grammar)
Now give it a go with your own.
quote:
>>>
>>> tree = chunk_parser.parse(lotr_pos_tags)
The following will show you how to view a picture that represents this:
tree:
>>>
>>> tree.draw()
The following is how the representation appears when it’s drawn out:
You have a noun and a noun.
phrases:
-
'a dangerous business'
has a determiner, an adjective, and a noun. -
'door'
has just a noun.
It’s time to have a look at chunking now that you’re familiar with the concept.
chinking.
Chinking
Whereas chunking is used to include a pattern, chinking is used to exclude a pattern. Chinking is used along with chunking, but chunking is used to include a pattern.
Let’s utilise the same quotation that you had in the chunking part of the discussion. You already have a list of tuples that contains each of the words in the quotation together with its respective part of speech in the tuple format.
tag:
>>>
>>> lotr_pos_tags
[('It', 'PRP'),
("'s", 'VBZ'),
('a', 'DT'),
('dangerous', 'JJ'),
('business', 'NN'),
(',', ','),
('Frodo', 'NNP'),
(',', ','),
('going', 'VBG'),
('out', 'RP'),
('your', 'PRP$'),
('door', 'NN'),
('.', '.')]
The next thing you need to do is construct a grammar that will help you decide what aspects of your chunks will be included and which will be left out. This time, there will be more than one line that you need to utilise since there will be more than one rule that you need to follow. Due to the fact that you are utilising more than one line for the grammar, you will need to utilise triple quotation marks (“””
):
>>>
>>> grammar = """
... Chunk: {<.*>+}
... }<JJ>{"""
The notation “. *>+” is the first rule of your grammar. This rule features curly braces that face inward () because it is used to identify what patterns you wish to include in your chunks. As a result, the curly braces are facing inward. In this particular scenario, you need to add everything that follows:.*>+.
The second guideline of your grammar is that you are expected to use JJ>. This rule contains curly brackets that face outward () because it is used to identify what patterns you want to exclude from your chunks. It is used to determine what patterns you want to exclude from your chunks. In this particular scenario, you wish to rule out the use of adjectives: JJ>.
Using this, you may build a chunk parser.
grammar:
>>>
>>> chunk_parser = nltk.RegexpParser(grammar)
Now divide your phrase into chunks using the chinks you
specified:
>>>
>>> tree = chunk_parser.parse(lotr_pos_tags)
This tree is given to you as a reward.
result:
>>>
>>> tree
Tree('S', [Tree('Chunk', [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT')]), ('dangerous', 'JJ'), Tree('Chunk', [('business', 'NN'), (',', ','), ('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.')])])
Since “dangerous” is an adjective, it was decided not to include it in the chunks in this particular instance ( JJ ). But if you receive a pictorial picture of it, you’ll have an easier time understanding it.
again:
>>>
>>> tree.draw()
You will get an image of the tree that looks like this:
You have now removed the term “dangerous” from your chunks, and as a result, you are only left with two chunks that hold the remaining information. The first chunk contains all of the content that was there before the adjective that was cut out of the previous section. The information that came after the adjective and before the omitted portion may be found in the second piece.
When you have become familiar with the process of excluding patterns from your chunks, it is time to investigate the concept of named entity recognition.
(NER).
Using Named Entity Recognition (NER)
Named entities are phrases that function as nouns and refer to particular places, persons, and organisations, among other types of things. Named entity identification allows you to not only locate the named entities that are present in your texts but also identify the kind of named entity that each one represents.
The following is a list of named entity kinds taken directly from the NLTK book:
:
NE type | Examples |
---|---|
ORGANIZATION | Georgia-Pacific Corp., WHO |
PERSON | Eddy Bonte, President Obama |
LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty a m, 1:30 p.m. |
MONEY | 175 million Canadian dollars, GBP 10.40 |
PERCENT | twenty pct, 18.75 % |
FACILITY | Washington Monument, Stonehenge |
GPE | South East Asia, Midlothian |
You may identify named entities by using the nltk.ne chunk() function. Let’s test it once more with the lotr pos tags function.
out:
>>>
>>> nltk.download("maxent_ne_chunker")
>>> nltk.download("words")
>>> tree = nltk.ne_chunk(lotr_pos_tags)
Now, give some attention to the image.
representation:
>>>
>>> tree.draw()
What you receive is as follows:
Can you see how Frodo has been given the status of a PERSON? You also have the option to use the binary=True argument if all you care about is the identity of the named entities and not the kind of named entity they are.
are:
>>>
>>> tree = nltk.ne_chunk(lotr_pos_tags, binary=True)
>>> tree.draw()
You can only see that Frodo is a NE at this point.
That is how it is possible to recognise named things! Nevertheless, you can take this strategy one step further and directly extract named things from the text you have. Construct a string that will later have named entities extracted from it. You are free to use this passage from the book The War of the Worlds.
:
>>>
>>> quote = """
... Men like Schiaparelli watched the red planet—it is odd, by-the-bye, that
... for countless centuries Mars has been the star of war—but failed to
... interpret the fluctuating appearances of the markings they mapped so well.
... All that time the Martians must have been getting ready.
...
... During the opposition of 1894 a great light was seen on the illuminated
... part of the disk, first at the Lick Observatory, then by Perrotin of Nice,
... and then by other observers. English readers heard of it first in the
... issue of Nature dated August 2."""
Now you need to develop a method to pull out named
entities:
>>>
>>> def extract_ne(quote):
... words = word_tokenize(quote, language=language)
... tags = nltk.pos_tag(words)
... tree = nltk.ne_chunk(tags, binary=True)
... return set(
... " ".join(i[0] for i in t)
... for t in tree
... if hasattr(t, "label") and t.label() == "NE"
... )
You may avoid gathering the same named entities more than once by using this function. To do this, you must first tokenize the data word by word, then add part of speech tags to the individual words, and then extract named entities using the tags. When you specified binary=True in your query, the named entities that are returned to you will not have more descriptive labels. You’ll just be aware of the fact that they are named entities.
Take a good look at the data that you have.
extracted:
>>>
>>> extract_ne(quote)
{'Lick Observatory', 'Mars', 'Nature', 'Perrotin', 'Schiaparelli'}
You did not get the city of Nice, which may be because NLTK understood it as a typical English adjective; nonetheless, you did get the city of Nice.
following:
-
An institution:
'Lick Observatory'
-
A planet:
'Mars'
-
A publication:
'Nature'
-
People:
'Perrotin'
,
'Schiaparelli'
That is some very respectable looking
variety!
Getting Text to Analyze
After practising certain text processing tasks on a few short sample texts, you are now prepared to do an analysis on a large number of texts all at once. A corpus is the technical term for a collection of texts. The Novel Translation Toolkit (NLTK) offers a number of corpora that span a wide range of topics, from the novels provided by Project Gutenberg to the inaugural addresses given by presidents of the United States.
Importing the texts to be analysed into NLTK is the first step that must be taken before beginning any analysis. This needs the quite sizeable nltk.download(“book”) function to be used.
download:
>>>
>>> nltk.download("book")
>>> from nltk.book import *
You now have access to a few linear works, such as Sense and Sensibility and Monty Python and the Holy Grail, as well as a few collections of texts. Sense and Sensibility is one of the linear texts you have access to (such as a chat corpus and a personals corpus). Since the human condition is so interesting, let's take a deeper look at the personals corpus and see what insights we can get from it.
The personals advertising that were an early form of internet dating may be found in this corpus, which is a compilation of such ads. If you were looking for a date, you could put an advertisement in a newspaper and then wait for other people who read the article to get in touch with you.
Check out Chapter 3 of Natural Language Processing with Python – Analyzing Text using the Natural Language Toolkit if you are interested in learning how to acquire different texts to analyse. This chapter can be found in Natural Language Processing with Python.
.
Using a Concordance
When you utilise a concordance, you are able to see the occurrences of each word, in addition to the context in which it was used most recently. This may provide you a glimpse into how a term is being used at the sentence level as well as what other words are being used alongside it.
Let's hear it from these decent folks who are on the lookout for a romantic partner, shall we? The name of the personals corpus is text8, and we are going to use the function.concordance() on it, with the argument "man."
:
Introductory Examples for the NLTK Book
Curiously, the final three of the fourteen matches have to do with wanting an honest guy,
specifically:
-
SEEKING HONEST MAN
-
Seeks 35 - 45 , honest man with good SOH & similar interests
-
genuine , caring , honest and normal man for fship , poss rship
Let's look at the word "woman" to see if there's a pattern like the last one.
:
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
During the first match, the subject of honesty was brought up.
only:
>>>
>>> text8.concordance("man")
Displaying 14 of 14 matches:
to hearing from you all . ABLE young man seeks , sexy older women . Phone for
ble relationship . GENUINE ATTRACTIVE MAN 40 y . o ., no ties , secure , 5 ft .
ship , and quality times . VIETNAMESE MAN Single , never married , financially
ip . WELL DRESSED emotionally healthy man 37 like to meet full figured woman fo
nth subs LIKE TO BE MISTRESS of YOUR MAN like to be treated well . Bold DTE no
eeks lady in similar position MARRIED MAN 50 , attrac . fit , seeks lady 40 - 5
eks nice girl 25 - 30 serious rship . Man 46 attractive fit , assertive , and k
40 - 50 sought by Aussie mid 40s b / man f / ship r / ship LOVE to meet widowe
discreet times . Sth E Subs . MARRIED MAN 42yo 6ft , fit , seeks Lady for discr
woman , seeks professional , employed man , with interests in theatre , dining
tall and of large build seeks a good man . I am a nonsmoker , social drinker ,
lead to relationship . SEEKING HONEST MAN I am 41 y . o ., 5 ft . 4 , med . bui
quiet times . Seeks 35 - 45 , honest man with good SOH & similar interests , f
genuine , caring , honest and normal man for fship , poss rship . S / S , S /
Although while using a concordance to go through a corpus won’t offer you the whole picture, it might still be fascinating to have a look over it to see if anything pops out.
out.
Making a Dispersion Plot
A dispersion plot is a useful tool for determining the frequency with which a certain word occurs as well as the locations in which it does so. At this point, we have searched for “man” and “woman,” but it would be interesting to compare the frequency with which those terms are used to the frequency with which their antecedents are used.
synonyms:
>>>
>>> text8.concordance("woman")
Displaying 11 of 11 matches:
at home . Seeking an honest , caring woman , slim or med . build , who enjoys t
thy man 37 like to meet full figured woman for relationship . 48 slim , shy , S
rry . MALE 58 years old . Is there a Woman who would like to spend 1 weekend a
other interests . Seeking Christian Woman for fship , view to rship . SWM 45 D
ALE 60 - burly beared seeks intimate woman for outings n / s s / d F / ston / P
ington . SCORPIO 47 seeks passionate woman for discreet intimate encounters SEX
le dad . 42 , East sub . 5 " 9 seeks woman 30 + for f / ship relationship TALL
personal trainer looking for married woman age open for fun MARRIED Dark guy 37
rinker , seeking slim - medium build woman who is happy in life , age open . AC
. O . TERTIARY Educated professional woman , seeks professional , employed man
real romantic , age 50 - 65 y . o . WOMAN OF SUBSTANCE 56 , 59 kg ., 50 , fit
The following is the dispersion plot that you receive:
One occurrence of a word is shown by the vertical blue line on its own. Every single row of blue lines across the horizontal reflects the corpus in its entirety. This scenario illustrates
that:
-
"lady"
was used a lot more than
"woman"
or
"girl"
. There were no instances of
"gal"
. -
"man"
and
"guy"
were used a similar number of times and were more common than
"gentleman"
or
"boy"
.
When you wish to see where certain words appear across a text or corpus, you may do it with the use of a dispersion plot. If you are examining a single piece of text, this might assist you in seeing which words appear in close proximity to one another. When you do an analysis of a collection of texts that is ordered chronologically, it might be helpful to identify which terms were used more often or less frequently over the course of a certain period of time.
Keeping with the topic of love, I suggest that you try your hand at creating a dispersion plot for Sense and Sensibility, which is text2. See what kind of information you can unearth by doing so. As residences are a common topic in the works of Jane Austen, you should populate your dispersion plot with the names of a few.
homes:
Seeking an honest , caring woman , slim or med . build
The following is the storyline that you get:
It would seem that Allenham is discussed at length during the first one-third of the book, but after that, there is very little reference of him. On the other hand, Cleveland is mentioned quite seldom in the first two thirds of the book, but it is mentioned rather often in the last third. This distribution illustrates the changes that have taken place in the nature of Marianne and Willoughby’s relationship.
:
-
Allenham
is the home of Willoughby’s benefactress and comes up a lot when Marianne is first interested in him. -
Cleveland
is a home that Marianne stays at after she goes to see Willoughby in London and things go wrong.
There are many different types of visualisations that may be created for textual data. One example is the dispersion plot. The second factor that you are going to look at is the frequency.
distributions.
Making a Frequency Distribution
You may use a frequency distribution to determine which words appear in your text the most often and then create a list of those terms. You will need to begin with an import as your first step.
:
>>>
>>> text8.dispersion_plot(
... ["woman", "lady", "girl", "gal", "man", "gentleman", "boy", "guy"]
... )
The FreqDist class is a descendant of the collections.Counter class. The following is a description of the process that can be used to generate a frequency distribution for the entire corpus of personals.
ads:
>>>
>>> text2.dispersion_plot(["Allenham", "Whitwell", "Cleveland", "Combe"])
Since 1108 samples and 4867 outcomes constitute a significant amount of data, the first step is to condense it. Here is how you may see the top 20 most often used terms in the.
corpus:
>>>
>>> from nltk import FreqDist
Your frequency distribution contains a significant number of stop words; however, you can eliminate them in the same way that you did earlier. Make a list of all of the text8 words that aren’t considered stop words.
words:
>>>
>>> frequency_distribution = FreqDist(text8)
>>> print(frequency_distribution)
<FreqDist with 1108 samples and 4867 outcomes>
Create a frequency list of the non-stop words in your corpus now that you have a list of all of the words in your corpus that aren’t stop words.
distribution:
>>>
>>> frequency_distribution.most_common(20)
[(',', 539),
('.', 353),
('/', 110),
('for', 99),
('and', 74),
('to', 74),
('lady', 68),
('-', 66),
('seeks', 60),
('a', 52),
('with', 44),
('S', 36),
('ship', 33),
('&', 30),
('relationship', 29),
('fun', 28),
('in', 27),
('slim', 27),
('build', 27),
('o', 26)]
Consider the top 20 most typical examples:
words:
>>>
>>> meaningful_words = [
... word for word in text8 if word.casefold() not in stop_words
... ]
You could make a book off of this list.
graph:
>>>
>>> frequency_distribution = FreqDist(meaningful_words)
The graph that you receive is as follows:
A selection of the most frequently used terms
are:
-
'lady'
-
'seeks'
-
'ship'
-
'relationship'
-
'fun'
-
'slim'
-
'build'
-
'smoker'
-
'50'
-
'non'
-
'movies'
-
'good'
-
'honest'
Based on what you already know about the individuals who are posting these personals advertisements, it seems as if they are interested in being truthful and make frequent use of the term “lady.” In addition, the terms “thin” and “built” appear the same amount of times throughout the text. As you were studying about concordances, you saw that the terms thin and build were often used in close proximity to one another; hence, it’s possible that these two words are frequently used together in this corpus. This moves us to the next
collocations!
Finding Collocations
Words that are often used together form what is known as a collocation. Check out The BBI Dictionary of English Word Combinations if you’re curious in popular collocations used in English; it’s a good resource for learning these kinds of phrases. It is a useful resource that you may use to assist you in ensuring that the idioms you use in your writing are correct. The following are some instances of the word’s usage in idiomatic phrases:
“tree”:
- Syntax tree
- Family tree
- Decision tree
Using the.collocations() method on your corpus will allow you to see word combinations that appear often in your data.
it:
>>>
>>> frequency_distribution.most_common(20)
[(',', 539),
('.', 353),
('/', 110),
('lady', 68),
('-', 66),
('seeks', 60),
('ship', 33),
('&', 30),
('relationship', 29),
('fun', 28),
('slim', 27),
('build', 27),
('smoker', 23),
('50', 23),
('non', 22),
('movies', 22),
('good', 21),
('honest', 20),
('dining', 19),
('rship', 18)]
There were people who had a slender build, as well as others who had a medium build and various additional word combinations. There won’t be time for extended strolls on the beach however!
But what would happen if, after lemmatizing the words in your corpus, you searched for collocations? Would you locate certain word combinations that you didn’t notice the first time since they appeared in slightly different versions?
You’ll already have a lemmatizer if you followed the instructions from earlier, but you can’t run collocations() on just any data type, so you’ll need to do some preparation work beforehand. If you did follow the directions from earlier, you’ll already have a lemmatizer. To begin, compile a list of all the lemmatized forms of the words that appear in the text 8.
:
>>>
>>> frequency_distribution.plot(20, cumulative=True)
But, in order for you to be able to carry out the linguistic processing activities that you have been shown thus far, you will need to create an NLTK text containing this.
list:
>>>
>>> text8.collocations()
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build
The following is a description of how to see the collocations in your new text:
:
>>>
>>> lemmatized_words = [lemmatizer.lemmatize(word) for word in text8]
This new list of collocations is lacking one when compared to the one you previously provided.
few:
-
weekends away
-
poss rship
The concept of peaceful evenings is preserved in the lemmatized form of the term, which is quiet night. Your most recent search for collocations pulled up some news articles as well.
ones:
-
year old
suggests that users often mention ages. -
photo pls
suggests that users often request one or more photos.
You may find out what people are talking about and how they are talking about it by looking for frequent word combinations and finding them in this way.