
Announcing support for GROUP BY, SUM, and other aggregation queries in R2 SQL
When you’re dealing with large amounts of data, it’s helpful to get a quick overview - which is exactly what aggregations provide in SQL. Aggregations, known as “GROUP BY queries”, provide a bird’s eye view, so you can quickly gain insights from vast volumes of data. That’s why we are excited to announce support for aggregations in R2 SQL , Cloudflare's serverless, distributed, analytics query engine, which is capable of running SQL queries over data stored in R2 Data Catalog . Aggregations will allow users of R2 SQL to spot important trends and changes in the data, generate reports and find anomalies in logs. This release builds on the already supported filter queries, which are foundational for analytical workloads, and allow users to find needles in haystacks of Apache Parquet files. In this post, we’ll unpack the utility and quirks of aggregations, and then dive into how we extended R2 SQL to support running such queries over vast amounts of data stored in R2 Data Catalog. The importance of aggregations in analytics Aggregations, or “GROUP BY queries”, generate a short summary of the underlying data. A common use case for aggregations is generating reports. Consider a table called “sales”, which contains historical data of all sales across various countries and departments of some organisation. One could easily generate a report on the volume of sales by department using this aggregation query: SELECT department, sum(value) FROM sales GROUP BY department The “GROUP BY” statement allows us to split table rows into buckets. Each bucket has a label corresponding to a particular department. Once the buckets are full, we can then calculate “sum(value)” for all rows in each bucket, giving us the total volume of sales performed by the corresponding department. For some reports, we might only be interested in departments that had the largest volume. That’s where an “ORDER BY” statement comes in handy: SELECT department, sum(value) FROM sales GROUP BY department ORDER BY sum(value) DESC LIMIT 10 Here we instruct the query engine to sort all department buckets by their total sales volume in the descending order and only return the top 10 largest. Finally, we might be interested in filtering out anomalies. For example, we might want to only include departments that had more than five sales total in our report. We can easily do that with a “HAVING” statement: SELECT department, sum(value), count(*) FROM sales GROUP BY department HAVING count(*) > 5 ORDER BY sum(value) DESC LIMIT 10 Here we added a new aggregate function to our query - “count(*)” - which calculates how many rows ended up in each bucket. This directly corresponds to the number of sales in each department, so we have also added a predicate in the “HAVING” statement to make sure that we only leave buckets with more than five rows in them. Two approaches to aggregation: compute sooner or later Aggregation queries have a curious property: they can reference columns that are not stored anywhere. Consider “sum(value)”: this column is computed by...
Preview: ~500 words
Continue reading at Cloudflare
Read Full Article