Introduction to spaCy

spaCy  is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. It offers the fastest syntactic parser in the world. The library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages.

spaCy is a relatively new framework in the Python Natural Language Processing environment but it quickly gains ground and will most likely become the de facto library.

spaCy focuses on providing software for production usage. As of version 1.0, spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, Keras, Scikit-learn or PyTorch. spaCy’s machine learning library, Thinc, is also available as a separate open-source Python library.

spaCy comes with several extensions and visualizations that are available as free, open-source libraries:

  • Thinc: A machine learning library optimized for CPU usage and deep learning with text input.
  • sense2vec: A library for computing word similarities, based on Word2vec and sense2vec.
  • displaCy: An open-source dependency parse tree visualizer built with JavaScript, CSS and SVG.
  • displaCy: An open-source named entity visualizer built with JavaScript and CSS.

Some of the features of spaCy are as follows:

  1. It’s really FAST:
    Written in Cython, it was specifically designed to be as fast as possible.
  2. It’s really ACCURATE
    spaCy implementation of its dependency parser is one of the best-performing in the world:
    It Depends: Dependency Parser Comparison
    Using A Web-based Evaluation Tool.
  3. Batteries included:

    Index preserving tokenization (details about this later), Models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing. Supports 8 languages out of the box. Easy and beautiful visualizations. Pretrained word vectors.

  4. Extensible
    It plays nicely with all the other already existing tools that you know and love: Scikit-Learn, TensorFlow, gensim.
  5. Deep-learning Ready
    It also has its own deep learning framework that’s especially designed for NLP tasks.

What spaCy isn’t ?

  • spaCy is not a platform or “an API”. Unlike a platform, spaCy does not provide a software as a service, or a web application. It’s an open-source library designed to help you build NLP applications, not a consumable service.
  • spaCy is not an out-of-the-box chat bot engine. While spaCy can be used to power conversational applications, it’s not designed specifically for chat bots, and only provides the underlying text processing capabilities.
  • spaCy is not research software. It’s built on the latest research, but it’s designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.

The above is a brief about spaCy compiled from various sites. Watch this space for more information on the latest trends in Technology.

Leave a Reply

Your email address will not be published. Required fields are marked *