{"id":1356,"date":"2021-07-08T07:08:29","date_gmt":"2021-07-08T07:08:29","guid":{"rendered":"https:\/\/blog.amt.in\/?p=1356"},"modified":"2021-07-08T07:08:29","modified_gmt":"2021-07-08T07:08:29","slug":"introduction-to-spacy","status":"publish","type":"post","link":"https:\/\/blog.amt.in\/index.php\/2021\/07\/08\/introduction-to-spacy\/","title":{"rendered":"Introduction to spaCy"},"content":{"rendered":"<p>spaCy is an\u00c2\u00a0open-source\u00c2\u00a0software library for advanced\u00c2\u00a0natural language processing, written in the programming languages\u00c2\u00a0Python\u00c2\u00a0and\u00c2\u00a0Cython.\u00c2\u00a0The library is published under the\u00c2\u00a0MIT license\u00c2\u00a0and its main developers are\u00c2\u00a0Matthew Honnibal\u00c2\u00a0and\u00c2\u00a0Ines Montani, the founders of the software company Explosion.<\/p>\n<p>Unlike\u00c2\u00a0NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.\u00c2\u00a0spaCy also supports\u00c2\u00a0deep learning\u00c2\u00a0workflows that allow connecting statistical models trained by popular\u00c2\u00a0machine learning\u00c2\u00a0libraries like\u00c2\u00a0TensorFlow,\u00c2\u00a0PyTorch\u00c2\u00a0or\u00c2\u00a0MXNet\u00c2\u00a0through its own machine learning library Thinc.\u00c2\u00a0Using Thinc as its backend, spaCy features\u00c2\u00a0convolutional neural network\u00c2\u00a0models for\u00c2\u00a0part-of-speech tagging,\u00c2\u00a0dependency parsing,\u00c2\u00a0text categorization\u00c2\u00a0and\u00c2\u00a0named entity recognition\u00c2\u00a0(NER). Prebuilt statistical\u00c2\u00a0neural network\u00c2\u00a0models to perform these task are available for 17 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for\u00c2\u00a0tokenization\u00c2\u00a0for more than 65 languages allows users to train custom models on their own datasets as well.<\/p>\n<ul>\n<li>Version 1.0 was released on October 19, 2016 and included preliminary support for deep learning workflows by supporting custom processing pipelines.\u00c2\u00a0It further included a rule matcher that supported\u00c2\u00a0entity\u00c2\u00a0annotations, and an officially documented training API.<\/li>\n<li>Version 2.0 was released on November 7, 2017 and introduced convolutional neural network models for 7 different languages.\u00c2\u00a0It also supported custom processing pipeline components and extension attributes, and featured a built-in trainable\u00c2\u00a0text classification\u00c2\u00a0component.<\/li>\n<li>Version 3.0 was released on February 1, 2021 and introduced state-of-the-art\u00c2\u00a0transformer-based pipelines.\u00c2\u00a0It also introduced a new configuration system and training workflow, as well as type hints and project templates. This version dropped support for\u00c2\u00a0Python 2.<\/li>\n<\/ul>\n<p>spaCy is a\u00c2\u00a0free, open-source library\u00c2\u00a0for advanced\u00c2\u00a0Natural Language Processing\u00c2\u00a0(NLP) in Python.<\/p>\n<p>If you\u00e2\u20ac\u2122re working with a lot of text, you\u00e2\u20ac\u2122ll eventually want to know more about it. For example, what\u00e2\u20ac\u2122s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?<\/p>\n<p>spaCy is designed specifically for\u00c2\u00a0production use\u00c2\u00a0and helps you build applications that process and \u00e2\u20ac\u0153understand\u00e2\u20ac\u009d large volumes of text. It can be used to build\u00c2\u00a0information extraction\u00c2\u00a0or\u00c2\u00a0natural language understanding\u00c2\u00a0systems, or to pre-process text for\u00c2\u00a0deep learning.<\/p>\n<h4 id=\"what-spacy-isnt\" class=\"d8fa3eb9 _693d0bc3\">What spaCy isn\u00e2\u20ac\u2122t<\/h4>\n<ul class=\"_0c0c9282\">\n<li class=\"_0aa126ad a4f93fbe\"><strong>spaCy is not a platform or \u00e2\u20ac\u0153an API\u00e2\u20ac\u009d<\/strong>. Unlike a platform, spaCy does not provide a software as a service, or a web application. It\u00e2\u20ac\u2122s an open-source library designed to help you build NLP applications, not a consumable service.<\/li>\n<li class=\"_0aa126ad a4f93fbe\"><strong>spaCy is not an out-of-the-box chat bot engine<\/strong>. While spaCy can be used to power conversational applications, it\u00e2\u20ac\u2122s not designed specifically for chat bots, and only provides the underlying text processing capabilities.<\/li>\n<li class=\"_0aa126ad a4f93fbe\"><strong>spaCy is not research software<\/strong>. It\u00e2\u20ac\u2122s built on the latest research, but it\u00e2\u20ac\u2122s designed to get things done. This leads to fairly different design decisions than\u00c2\u00a0<span class=\"_31865462\">NLTK<\/span>\u00c2\u00a0or\u00c2\u00a0CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.<\/li>\n<li class=\"_0aa126ad a4f93fbe\"><strong>spaCy is not a company<\/strong>. It\u00e2\u20ac\u2122s an open-source library. Our company publishing spaCy and other software is called\u00c2\u00a0Explosion.<\/li>\n<\/ul>\n<p>While some of spaCy\u00e2\u20ac\u2122s features work independently, others require\u00c2\u00a0trained pipelines\u00c2\u00a0to be loaded, which enable spaCy to\u00c2\u00a0<strong>predict<\/strong>\u00c2\u00a0linguistic annotations \u00e2\u20ac\u201c for example, whether a word is a verb or a noun. A trained pipeline can consist of multiple components that use a statistical model trained on labeled data. spaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules. Pipeline packages can differ in size, speed, memory usage, accuracy and the data they include. The package you choose always depends on your use case and the texts you\u00e2\u20ac\u2122re working with. For a general-purpose use case, the small, default packages are always a good start. They typically include the following components:<\/p>\n<ul class=\"_0c0c9282\">\n<li class=\"_0aa126ad\"><strong>Binary weights<\/strong>\u00c2\u00a0for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.<\/li>\n<li class=\"_0aa126ad\"><strong>Lexical entries<\/strong>\u00c2\u00a0in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.<\/li>\n<li class=\"_0aa126ad\"><strong>Data files<\/strong>\u00c2\u00a0like lemmatization rules and lookup tables.<\/li>\n<li class=\"_0aa126ad\"><strong>Word vectors<\/strong>, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.<\/li>\n<li class=\"_0aa126ad\"><strong>Configuration<\/strong>\u00c2\u00a0options, like the language and processing pipeline settings and model implementations to use, to put spaCy in the correct state when you load the pipeline.<\/li>\n<\/ul>\n<p>The above is a brief about spAcy. Watch this space for more updates on the latest trends in Technology.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>spaCy is an\u00c2\u00a0open-source\u00c2\u00a0software library for<\/p>\n","protected":false},"author":1,"featured_media":1357,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[150,179,7],"tags":[152,180,18],"class_list":["post-1356","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-open-source-software-library","category-spacy","category-techtrends","tag-open-source-software-library","tag-spacy","tag-technology"],"_links":{"self":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1356","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/comments?post=1356"}],"version-history":[{"count":1,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1356\/revisions"}],"predecessor-version":[{"id":1358,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1356\/revisions\/1358"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/media\/1357"}],"wp:attachment":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/media?parent=1356"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/categories?post=1356"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/tags?post=1356"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}