Introduction to Presto

Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB. One can even query data from multiple data sources within a single query. Presto is community driven open-source software released under the Apache License.

SQL (Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e. data incorporating relations among entities and variables.

SQL offers two main advantages over older read–write APIs such as ISAM or VSAM. Firstly, it introduced the concept of accessing many records with one single command. Secondly, it eliminates the need to specify how to reach a record, e.g. with or without an index.

Presto was originally designed and developed at Facebook for their data analysts to run interactive queries on its large data warehouse in Apache Hadoop. Before Presto, the data analysts at Facebook relied on Apache Hive for running SQL analytics on their multi petabyte data warehouse. Hive was inadequate for Facebook’s scale and Presto was invented to fill the gap to run fast queries. Original development started in 2012 and deployed at Facebook later that year. In November 2013, Facebook announced its release as open source 2013. In 2014, Netflix disclosed they used Presto on 10 petabytes of data stored in the Amazon Simple Storage Service (S3).

In January 2019, the Presto Software Foundation was announced. The foundation is a not-for-profit organization dedicated to the advancement of the Presto open source distributed SQL query engine. Development of Presto continues independently with PrestoDB maintained by Facebook and PrestoSQL maintained by the Presto Software Foundation with some cross pollination of code.

Presto’s architecture is very similar to a classic database management system using cluster computing (MPP). It can be visualized as one coordinator node working in sync with multiple worker nodes. Clients submit SQL statements that get parsed and planned following which parallel tasks are scheduled to workers. Workers jointly process rows from the data sources and produce results that are returned to the client. Compared to the original Apache Hive execution model which used the Hadoop Map Reduce mechanism on each query, Presto does not write intermediate results to disk resulting in a significant speed improvement. Presto is written in the Java programming language.

Connolly and Begg define Database Management System (DBMS) as a “software system that enables users to define, create, maintain and control access to the database”.

The DBMS acronym is sometime extended to indicated the underlying database model, with RDBMS for relational, OODBMS or ORDBMS for the object (orientated) model and ORDBMS for Object-Relational. Other extensions can indicate some other characteristic, such as DDBMS for a distributed database management systems.

The functionality provided by a DBMS can vary enormously. The core functionality is the storage, retrieval and update of data. Codd proposed the following functions and services a fully-fledged general purpose DBMS should provide:

  • Data storage, retrieval and update
  • User accessible catalog or data dictionary describing the metadata
  • Support for transactions and concurrency
  • Facilities for recovering the database should it become damaged
  • Support for authorization of access and update of data
  • Access support from remote locations
  • Enforcing constraints to ensure data in the database abides by certain rules

A single Presto query can combine data from multiple sources. Presto offers connectors to data sources including files in Alluxio, Hadoop Distributed File System, Amazon S3, MySQL, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Apache Kudu, Apache Phoenix, Apache Kafka, Apache Cassandra, Apache Accumulo, MongoDB and Redis. Unlike other Hadoop distribution-specific tools, such as Cloudera Impala, Presto can work with any flavor of Hadoop or without it. Presto supports separation of compute and storage and may be deployed both on premises and in the cloud.

The above is a brief about Presto. Watch this space for more updates on the latest trends in Technology.

Leave a Reply

Your email address will not be published. Required fields are marked *