Introduction to spaCy

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage. spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc. Using Thinc as its backend, spaCy features convolutional neural network models for part-of-speech tagging, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical neural network models to perform these task are available for 17 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for tokenization for more than 65 languages allows users to train custom models on their own datasets as well.

  • Version 1.0 was released on October 19, 2016 and included preliminary support for deep learning workflows by supporting custom processing pipelines. It further included a rule matcher that supported entity annotations, and an officially documented training API.
  • Version 2.0 was released on November 7, 2017 and introduced convolutional neural network models for 7 different languages. It also supported custom processing pipeline components and extension attributes, and featured a built-in trainable text classification component.
  • Version 3.0 was released on February 1, 2021 and introduced state-of-the-art transformer-based pipelines. It also introduced a new configuration system and training workflow, as well as type hints and project templates. This version dropped support for Python 2.

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

What spaCy isn’t

  • spaCy is not a platform or “an API”. Unlike a platform, spaCy does not provide a software as a service, or a web application. It’s an open-source library designed to help you build NLP applications, not a consumable service.
  • spaCy is not an out-of-the-box chat bot engine. While spaCy can be used to power conversational applications, it’s not designed specifically for chat bots, and only provides the underlying text processing capabilities.
  • spaCy is not research software. It’s built on the latest research, but it’s designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.
  • spaCy is not a company. It’s an open-source library. Our company publishing spaCy and other software is called Explosion.

While some of spaCy’s features work independently, others require trained pipelines to be loaded, which enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. A trained pipeline can consist of multiple components that use a statistical model trained on labeled data. spaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules. Pipeline packages can differ in size, speed, memory usage, accuracy and the data they include. The package you choose always depends on your use case and the texts you’re working with. For a general-purpose use case, the small, default packages are always a good start. They typically include the following components:

  • Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
  • Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
  • Data files like lemmatization rules and lookup tables.
  • Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
  • Configuration options, like the language and processing pipeline settings and model implementations to use, to put spaCy in the correct state when you load the pipeline.

The above is a brief about spAcy. Watch this space for more updates on the latest trends in Technology.

Leave a Reply

Your email address will not be published. Required fields are marked *