Introduction to Apache Spark – Advanced Millennium Technologies

Apache SparkÂ is anÂ open-sourceÂ cluster-computingÂ framework.Â Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-onlyÂ multi-setÂ of data items distributed over a cluster of machines, that is maintained in aÂ fault-tolerantÂ way.Â In Spark 1.x, the RDD was the primaryÂ application programming interfaceÂ (API), but as of Spark 2.x use of the Dataset API is encouragedÂ even though the RDD API is notÂ deprecated.Â The RDD technology still underlies the Dataset API.

Spark and its RDDs were developed in 2012 in response to limitations in theÂ Map-reduceÂ cluster computingÂ paradigm, which forces a particular linearÂ data-flowÂ structure on distributed programs: Map-reduce programs read input data from disk,Â mapÂ a function across the data,Â reduceÂ the results of the map, and store reduction results on disk. Spark’s RDDs function as aÂ working setÂ for distributed programs that offers a (deliberately) restricted form of distributedÂ shared memory.

Spark facilitates the implementation of bothÂ iterative algorithms, that visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeatedÂ database-style querying of data. TheÂ latencyÂ of such applications may be reduced by several orders of magnitude compared to a Map-reduce implementation (as was common inÂ Apache HadoopÂ stacks).Â Among the class of iterative algorithms are the training algorithms forÂ machine learningÂ systems, which formed the initial impetus for developing Apache Spark.

Apache Spark requires aÂ cluster managerÂ and aÂ distributed storage system. For cluster management, Spark supports standalone (native Spark cluster),Â Hadoop YARN, orÂ Apache Mesos.Â For distributed storage, Spark can interface with a wide variety, includingÂ Hadoop Distributed File System (HDFS),Â MapR File System (MapR-FS),Â Cassandra,Â OpenStack Swift,Â Amazon S3,Â Kudu, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor perÂ CPU core.

Since its release,Â Apache Spark, the unified analytics engine, has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basicÂ I/OÂ functionalities, exposed through an application programming interface (forÂ Java,Â Python,Â Scala, andÂ R) centered on the RDDÂ abstractionÂ (the Java API is available for other JVM languages, but is also usable for some other non-JVM languages, such asÂ Julia,Â that can connect to the JVM). This interface mirrors aÂ functional/higher-orderÂ model of programming: a “driver” program invokes parallel operations such as map,Â filterÂ or reduce on an RDD by passing a function to Spark, which then schedules the function’s execution in parallel on the cluster.Â These operations, and additional ones such asÂ joins, take RDDs as input and produce new RDDs. RDDs areÂ immutableÂ and their operations areÂ lazy; fault-tolerance is achieved by keeping track of the “lineage” of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, Java, or Scala objects.

The above is a brief about Apache Spark. Watch this space for more updates on the latest trends in Technology.

Leave a Reply Cancel reply