Introduction to Presto – Advanced Millennium Technologies

PrestoÂ is a high performance, distributedÂ SQLÂ query engine for big data. Its architecture allows users to query a variety of data sources such asÂ Hadoop,Â AWS S3,Â Alluxio,Â MySQL,Â Cassandra,Â Kafka, andÂ MongoDB. One can even query data from multiple data sources within a single query. Presto is community drivenÂ open-source softwareÂ released under theÂ Apache License.

SQLÂ (Structured Query Language)Â is aÂ domain-specific languageÂ used in programming and designed for managing data held in aÂ relational database management systemÂ (RDBMS), or for stream processing in aÂ relational data stream management systemÂ (RDSMS). It is particularly useful in handlingÂ structured data, i.e. data incorporating relations among entities and variables.

SQL offers two main advantages over older readâ€“writeÂ APIsÂ such asÂ ISAMÂ orÂ VSAM. Firstly, it introduced the concept of accessing many records with one single command. Secondly, it eliminates the need to specifyÂ howÂ to reach a record, e.g. with or without anÂ index.

Presto was originally designed and developed atÂ FacebookÂ for their data analysts to run interactive queries on its large data warehouse inÂ Apache Hadoop. Before Presto, the data analysts at Facebook relied onÂ Apache HiveÂ for running SQL analytics on their multi petabyte data warehouse. Hive was inadequate for Facebook’s scale and Presto was invented to fill the gap to run fast queries. Original development started in 2012 and deployed at Facebook later that year. In November 2013, Facebook announced its release as open source 2013.Â In 2014,Â NetflixÂ disclosed they used Presto on 10Â petabytesÂ of data stored in theÂ Amazon Simple Storage ServiceÂ (S3).

In January 2019, theÂ Presto Software FoundationÂ was announced. The foundation is a not-for-profit organization dedicated to the advancement of the Presto open source distributed SQL query engine. Development of Presto continues independently with PrestoDB maintained by Facebook and PrestoSQL maintained by the Presto Software Foundation with some cross pollination of code.

Prestoâ€™s architecture is very similar to a classicÂ database management systemÂ usingÂ cluster computingÂ (MPP). It can be visualized as one coordinator node working in sync with multiple worker nodes. Clients submit SQL statements that get parsed and planned following which parallel tasks are scheduled to workers. Workers jointly process rows from the data sources and produce results that are returned to the client. Compared to the originalÂ Apache HiveÂ execution model which used the HadoopÂ Map ReduceÂ mechanism on each query, Presto does not write intermediate results to disk resulting in a significant speed improvement. Presto is written in theÂ Java programming language.

Connolly and Begg define Database Management System (DBMS) as a “software system that enables users to define, create, maintain and control access to the database”.

The DBMS acronym is sometime extended to indicated the underlyingÂ database model, with RDBMS forÂ relational, OODBMS or ORDBMS for theÂ object (orientated) modelÂ and ORDBMS for Object-Relational. Other extensions can indicate some other characteristic, such as DDBMS for a distributed database management systems.

The functionality provided by a DBMS can vary enormously. The core functionality is the storage, retrieval and update of data.Â CoddÂ proposed the following functions and services a fully-fledged general purpose DBMS should provide:

Data storage, retrieval and update
User accessible catalog or data dictionary describing the metadata
Support for transactions and concurrency
Facilities for recovering the database should it become damaged
Support for authorization of access and update of data
Access support from remote locations
Enforcing constraints to ensure data in the database abides by certain rules

A single Presto query can combine data from multiple sources. Presto offers connectors to data sources including files inÂ Alluxio,Â Hadoop Distributed File System,Â Amazon S3,Â MySQL,Â PostgreSQL,Â Microsoft SQL Server,Â Amazon Redshift,Â Apache Kudu,Â Apache Phoenix,Â Apache Kafka,Â Apache Cassandra,Â Apache Accumulo,Â MongoDBÂ andÂ Redis. Unlike other Hadoop distribution-specific tools, such asÂ Cloudera Impala, Presto can work with any flavor of Hadoop or without it. Presto supports separation of compute and storage and may be deployed both on premises and in theÂ cloud.

The above is a brief about Presto. Watch this space for more updates on the latest trends in Technology.

Leave a Reply Cancel reply