Lateral joins on streaming SQL infrastructure like Materialize allow you to do some pretty amazing things. If 100,000 people want to track the top three posts of some common author, that result is determined and maintained just once.īecause you wrote it as SQL, rather than by hand. If 10 people want to start monitoring the top three cities in California, we’ll determine those three cities only once, and then join on the query identifiers to the results. The subquery is computed only once for each distinct parameter binding. In addition to doing more without being wildly complicated, this query plan shows off a really neat feature of lateral joins (and correlated subqueries). The new steps, %0 and %4 are pretty easy to explain: %0 exists because we need one instance of queries that retains the id column, and %4 exists to join queries against the results of the lateral join and recover the id to result association. Steps %1, %2 and %3 should look familiar they are the same as in the plan just above. | | implementation = Differential %3 %0.(#0) | | implementation = Differential %2 %1.(#0) Instead of a collection queries of state names, let’s imagine that you have distinct identifiers for each query, and the state name is just the associated data. However, the unused city data are all poised and ready the moment new query records show up a new record in queries would cause the join to produce the corresponding city records, and the TopK to update with the corresponding top cities for that state. The TopK operator, which actually expands out into a sequence of 16 differential dataflow Reduce operators, is spared all of that city data that isn’t required. It’s worth stressing again that this query plan holds back the cities for states that aren’t present in queries. ![]() The real work is in step %2, where we first join the queries and cities collections to extract the cities of interest, and then feed the results in to the TopK operator. Steps %0 and %1 are about naming and preparing the join inputs. | TopK group=(#0) order=(#3 desc) limit=3 offset=0 | | implementation = Differential %1 %0.(#0) Materialize has this neat EXPLAIN command that helps out when you want to inspect the plan we’ve produced for your query. To start, let’s check out the actual query plan for our query-driven lateral join. ![]() Let’s dive in to the inner workings of lateral joins, and check out their behavior on larger datasets. It has a bit of a different take on goals (a fast, eventually consistent, read cache), but its behavior is similar to using Materialize with lateral joins. The project lets you write SQL, use prepared statements, and it will handle populating the dataflow with the records relevant to the queries. Quick shout out to Noria, a super-interesting research project at MIT. If your goal is to build an application that needs to respond to thousands of parameterized queries each second, lateral joins are a great way to automatically turn your SQL prepared statements into high-throughput, data-driven, maintained views. Plus it ends up producing an output stream that not only reports answers but also monitors the changes to the query results for each parameter binding, until the binding is uninstalled. This is a high-throughput take on prepared statements, where many users can submit many concurrent parameter bindings, all on the data plane rather than control plane. In contrast, Materialize can get started on the query execution for these queries, building a dataflow that is ready to respond to a stream of parameter bindings for the statements. The RDBMS can do some amount of work even without you having yet specified the full query, but that work is mostly restricted to optimization. The closest connection is probably to prepared statements, which are ways to write queries with “holes” in them. This is pretty different from your standard RDBMS, in which queries happen once, and you don’t get to tweak their inputs live as they run. You can use live data to enable or disable incremental view maintenance, at the granularity of records. This is really important for queries like the above, because TopK is one of the relatively more expensive operators to maintain. The states we don’t ask about never reach the TopK operator. It only does the work for the states we ask for. The top_3s query does not compute the top three cities for each state and then hand out the results we ask for. This last part is subtle, and we’ll go in to a bit more detail in a bit. ![]() ![]() Whoever controls the contents of queries determines how much work we have to do as the data change. And generally, we will maintain the top three cities for any state added to the input, but not for any others. Materialize = > INSERT INTO queries VALUES ( 'CA' ) INSERT 0 1įor as long as the input contains 'CA' we will maintain the top three cities in California.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |