Javatpoint Logo
Javatpoint Logo

Natural Language Processing with Spacy in Python

Description of NLP and SpaCy

The artificial intelligence area of natural language processing (NLP) aims to enable computers to understand human language. Analyzing, measuring, comprehending, and extrapolating meaning from natural languages are all part of NLP.

Note: Transformer-based NLP models are currently the most effective. The BERT from Google and the GPT family from Open AI are examples of such models.

SpaCy now supports transformer-based models as of version 3.0. A simplified, CPU-optimized model was used to create the examples in this course. However, a transformer model can execute the examples in place. SpaCy supports all Hugging Face transformer types.

NLP has several applications and aids in the extraction of insights from unstructured text, including:

  • Automatic synthesis
  • Recognition of named entities
  • Mechanisms for resolving queries
  • Sentimental evaluation

SpaCy is a Cython-written, open-source NLP library for Python. SpaCy is made to make it simple to construct systems for general-purpose natural language processing or information extraction.

SpaCy Installation

This section covers downloading data and models for the English language and installing spaCy in a virtual environment.

Using the Python package manager pip, you may install spaCy. Utilizing a virtual environment is smart if you want to stay independent of system-wide programs. For additional information on virtual environments and pip, read Python Virtual Environments: A Primer and Using Python's pip to Manage Your Projects' Dependencies.

The first step is constructing, activating, and installing spaCy in a new virtual environment. To find out how to choose your operating system, below:

Your virtual environment has spaCy installed, and you are almost ready to use NLP. However, there's still something else you need to set up:

For diverse languages, there are several spaCy models. The preferred model for the English language is the encore web sm model. Installing the models separately is preferable since they take up a lot of space-downloading all the languages in one package would be too huge.

Open a Python REPL after the encore web sm model has finished downloading to ensure that the installation was successful:

If these lines execute without errors, spaCy was installed, and the models and data were downloaded correctly. You're now prepared to use spaCy to explore NLP!

Document Object for Text Processing

You will utilize spaCy in this section to analyze an input string that has been provided and read the same text from a file.

First, load the language model instance in spaCy:

The Language callable object that the load() function returns is often allocated to the NLP variable.

You create a Doc object, and then you may begin processing your data. A lexical token is represented by a series of Token objects called Doc objects. Each Token object contains details on a specific textual component, usually a single word. By invoking the Language object and passing it the input string as a parameter, you may create a Doc object:


['This,' 'tutorial,' 'is, ''about,' 'Natural,' 'Language,'
'Processing,' 'in,' 'spaCy,.' '']

In the example above, a Doc object is created using the text. From there, you can obtain a tonne of data on the text that has been analyzed.

For instance, you used a list comprehension to iterate over the Doc object, resulting in a string of Token objects. You used them. Text attribute on each Token object to retrieve the text that made up that Token.

However, you won't frequently be manually pasting text into the function Object() { [native code] }. You'll probably read it from a file instead:


['This,' 'tutorial,' 'is, ''about,' 'Natural,' 'Language,'
'Processing,' 'in,' 'spaCy,.' '', '\n']

In this example, you used path lib. Path objects. Read text() method to read the contents of the introduction.txt file. You will obtain the same outcome because the file has the same data as the previous example.

Detecting Sentences

Finding the beginning and conclusion of each sentence in each text is known as sentence detection. This makes it possible to separate a text into linguistically significant units. These units will be used when you analyze text to carry out activities like named-entity recognition and part-of-speech (POS) tagging, which you'll learn about later in training.

Sentences are extracted using the Doc object in spaCy. Sentence quality. The total number of sentences and the individual sentences for a particular input is extracted as follows:


Gus Proto is a Python...
He is interested in learning...

The sentences in the input were accurately recognized by spaCy in the example above. The sents command returns a list of Span objects representing a single phrase. The Span items can also be cut into pieces to create sentences.

By employing unique delimiters, sentence detection behavior can also be modified. Here is an illustration of how, in addition to the full stop or period (. ), an ellipsis (...) can be used as a delimiter:


Gus, can you...
never mind, I forgot what I was saying.
So, do you think we should ...

In this example, you defined a new method that accepts a Doc object as input using the @Language.component("set custom boundaries") decorator. This function's responsibility is to locate tokens in Doc that signal the start of sentences and set the sent start attribute to True. The function must then return the Doc object once more.

Then, using the. Add pipe() method; you may include the unique border function in the Language object. The word after an ellipsis will now be considered the beginning of a new sentence when text is parsed using this changed Language object.

Coins in the SpaCy

Tokenizing the text is a step in building the Doc container. Tokenization divides a text into its constituent parts, or tokens, represented as Token objects in spaCy.

You've already seen that using spaCy, iterating over the Doc object allows you to print the tokens. Token objects, however, also contain additional features that can be investigated. For illustration, the attribute on the Token still supports the original index position of the Token in the string:


Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142

To output the Token and the. IDX attribute, which denotes the starting point of the Token in the original text, you iterate over Doc in this example. This information could be handy for a future in-place word replacement.

Like many other features of spaCy, the tokenization procedure can be altered to recognize tokens on unique characters. This is frequently employed with hyphenated terms like "London-based."

To modify tokenization, you must add a new Tokenizer object to the callable Language object's tokenizer property.

Consider a text that connects words by using the @ sign rather than the standard hyphen (-). This will help you understand what is going on. Thus, you have London@based instead of London-based:


['for,' 'a', 'London@based,' 'Fintech,' 'company,'. '', 'He']

If you used a hyphen instead of the @ sign, you would receive three tokens instead of the usual parsing in this example, which read the London@based text as a single token.

You must create your own Tokenizer object if you want to use the @ sign as a custom infix:


['for,' 'a', 'London,' '@,' 'based,' 'Fintech,' 'company']

In this illustration, you first create a new Language object. Typically, while creating a new Tokenizer, you give it:

  • Vocabulary: A specific case storage container used to manage cases like contractions and emoticons.
  • Prefix search: A tool for handling preceding punctuation, including opening parentheses.
  • Suffix search: A function called suffix search deals with subsequent punctuation, such as closing parentheses.
  • Infix finder: a function that manages separators other than whitespace, such as hyphens, is an infix finder.
  • Token match: An optional Boolean function called token match matches strings should never be divided. It is helpful for items like URLs or integers and supersedes the earlier rules.

Regex functions that are accessed from compiled regex objects typically comprise the involved functions. You can generate the prefixes and suffixes' regex objects using the defaults displayed on lines 5 to 10 if you don't wish to alter them.

To create a custom infix function, you must specify a new list on line 12 with any desired regex patterns. After that, you connect your list to the Language objects. The attribute Defaults. Infixes must be cast to a list before joining. Doing this will incorporate all currently used infixes. You then retrieve your new regex object for infixes by passing the extended tuple as an argument to spacy. Util. compile infix regex().

The prefix, suffix, and infix regex objects. Search() methods indefinite() functions are passed to the Tokenizer function Object() { [native code] } when the function is called. The custom NLP object's tokenizer can now be changed.

You'll notice that the @ sign is now tokenized individually after that is finished.

Youtube For Videos Join Our Youtube Channel: Join Now


Help Others, Please Share

facebook twitter pinterest

Learn Latest Tutorials


Trending Technologies

B.Tech / MCA