{"id":1431,"date":"2021-10-26T07:09:45","date_gmt":"2021-10-26T07:09:45","guid":{"rendered":"https:\/\/blog.amt.in\/?p=1431"},"modified":"2021-10-26T07:09:45","modified_gmt":"2021-10-26T07:09:45","slug":"introduction-to-apache-spark","status":"publish","type":"post","link":"https:\/\/blog.amt.in\/index.php\/2021\/10\/26\/introduction-to-apache-spark\/","title":{"rendered":"Introduction to Apache Spark"},"content":{"rendered":"<p>Apache Spark\u00c2\u00a0is an\u00c2\u00a0open-source\u00c2\u00a0cluster-computing\u00c2\u00a0framework.\u00c2\u00a0Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-only\u00c2\u00a0multi-set\u00c2\u00a0of data items distributed over a cluster of machines, that is maintained in a\u00c2\u00a0fault-tolerant\u00c2\u00a0way.\u00c2\u00a0In Spark 1.x, the RDD was the primary\u00c2\u00a0application programming interface\u00c2\u00a0(API), but as of Spark 2.x use of the Dataset API is encouraged\u00c2\u00a0even though the RDD API is not\u00c2\u00a0deprecated.\u00c2\u00a0The RDD technology still underlies the Dataset API.<\/p>\n<p>Spark and its RDDs were developed in 2012 in response to limitations in the\u00c2\u00a0Map-reduce\u00c2\u00a0cluster computing\u00c2\u00a0paradigm, which forces a particular linear\u00c2\u00a0data-flow\u00c2\u00a0structure on distributed programs: Map-reduce programs read input data from disk,\u00c2\u00a0map\u00c2\u00a0a function across the data,\u00c2\u00a0reduce\u00c2\u00a0the results of the map, and store reduction results on disk. Spark&#8217;s RDDs function as a\u00c2\u00a0working set\u00c2\u00a0for distributed programs that offers a (deliberately) restricted form of distributed\u00c2\u00a0shared memory.<\/p>\n<p>Spark facilitates the implementation of both\u00c2\u00a0iterative algorithms, that visit their data set multiple times in a loop, and interactive\/exploratory data analysis, i.e., the repeated\u00c2\u00a0database-style querying of data. The\u00c2\u00a0latency\u00c2\u00a0of such applications may be reduced by several orders of magnitude compared to a Map-reduce implementation (as was common in\u00c2\u00a0Apache Hadoop\u00c2\u00a0stacks).\u00c2\u00a0Among the class of iterative algorithms are the training algorithms for\u00c2\u00a0machine learning\u00c2\u00a0systems, which formed the initial impetus for developing Apache Spark.<\/p>\n<p>Apache Spark requires a\u00c2\u00a0cluster manager\u00c2\u00a0and a\u00c2\u00a0distributed storage system. For cluster management, Spark supports standalone (native Spark cluster),\u00c2\u00a0Hadoop YARN, or\u00c2\u00a0Apache Mesos.\u00c2\u00a0For distributed storage, Spark can interface with a wide variety, including\u00c2\u00a0Hadoop Distributed File System (HDFS),\u00c2\u00a0MapR File System (MapR-FS),<span style=\"font-size: 14.1667px;\">\u00c2\u00a0<\/span>Cassandra,<span style=\"font-size: 14.1667px;\">\u00c2\u00a0<\/span>OpenStack Swift,\u00c2\u00a0Amazon S3,\u00c2\u00a0Kudu, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per\u00c2\u00a0CPU core.<\/p>\n<p>Since its release,\u00c2\u00a0Apache Spark, the unified analytics engine, has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.<\/p>\n<p>Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic\u00c2\u00a0I\/O\u00c2\u00a0functionalities, exposed through an application programming interface (for\u00c2\u00a0Java,\u00c2\u00a0Python,\u00c2\u00a0Scala, and\u00c2\u00a0R) centered on the RDD\u00c2\u00a0abstraction\u00c2\u00a0(the Java API is available for other JVM languages, but is also usable for some other non-JVM languages, such as\u00c2\u00a0Julia,\u00c2\u00a0that can connect to the JVM). This interface mirrors a\u00c2\u00a0functional\/higher-order\u00c2\u00a0model of programming: a &#8220;driver&#8221; program invokes parallel operations such as map,\u00c2\u00a0filter\u00c2\u00a0or reduce on an RDD by passing a function to Spark, which then schedules the function&#8217;s execution in parallel on the cluster.\u00c2\u00a0These operations, and additional ones such as\u00c2\u00a0joins, take RDDs as input and produce new RDDs. RDDs are\u00c2\u00a0immutable\u00c2\u00a0and their operations are\u00c2\u00a0lazy; fault-tolerance is achieved by keeping track of the &#8220;lineage&#8221; of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, Java, or Scala objects.<\/p>\n<p>The above is a brief about Apache Spark. Watch this space for more updates on the latest trends in Technology.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark\u00c2\u00a0is an\u00c2\u00a0open-source\u00c2\u00a0cluster-computing\u00c2\u00a0framework.\u00c2\u00a0Apache Spark has<\/p>\n","protected":false},"author":1,"featured_media":1433,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[205,844,7],"tags":[206,845,18],"class_list":["post-1431","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","category-computing-framework","category-techtrends","tag-apache-spark","tag-computing-framework","tag-technology"],"_links":{"self":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1431","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/comments?post=1431"}],"version-history":[{"count":1,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1431\/revisions"}],"predecessor-version":[{"id":1432,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1431\/revisions\/1432"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/media\/1433"}],"wp:attachment":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/media?parent=1431"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/categories?post=1431"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/tags?post=1431"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}