![]() It hides it and marks it to be deleted when you next vacuum. Take note of what UPDATE does.ĭelete doesn't delete the row. □ Avoid Delete if you can drop/truncate. even though all three of the 'when' conditions are true. □ CASE When statements take the first matchĬase when statements will take the first 'when' statement that is true, even if there are multiple that are true. So they can't distribute that query out to the other nodes sadly. Querying the system tables for logs or usage history can be slow as all of that stuff is taken care of by the leader node only. □ Query's against system tables can be slow □ Some data types have a base size you can't reduce.įor example varchar(4) takes up 8 bytes of space even though you have specified 4. Over time you will learn what Functions cost more This will forever be a balancing act of course. For example where invoicedate >= date_trunc('month',sysdate) has a higher performance cost than invoicedate >= "". Using functions can slow down performance. Redshift has no choice but to do a nested loop which means every SINGLE row in table a has to be checked against every row in table b, which can have massive amounts of overhead. ![]() When we join tables on conditional logic, i.e join table on blah > blah or join table on blah between bleh and bloh. □ Conditional Logic on Joins performs the worst □ Comparison Operators > LIKE > Similar to > REGEX/POSIXĪpplying Logic to your dataset comes at a cost in terms of performance.Ĭomparison Operators such as = perform better than LIKE, which in turn perform better than Similar To.Īnd anything and everything performs better than REGEX / POSIX If the data types don't match, Redshift will throw an error, but if they do match, you won't notice. So it will try and put customer_type data into the createddate column. Redshift will attempt to insert it in that order. ![]() If you build a table with the columns in this order:īut your insert statement is in this order: □ Redshift Inserts columns in the order you specify - even if they don't line up AZ64 should be used on your numbers, ZSTD on the rest.Ī series of things to be aware of, most of them have come from my own experience. LZO's best of all worlds compression has been replaced by ZSTD and AZ64 who do a better job. ❌ Don't use LZO, when you can use ZSTD or AZ64 ❌ Don't use Dist Style All on very large or small tables.Ībout 500k-1m is the sweet spot for Dist Style ALL, remember it gets copied to every node. Not implementing an Interleaved sort key can result in very poor result return time and long write / update / vacuum wait times. Interleaved Sort keys are complicated, only use them if you know what you're doing, by default use compound sort keys. ❌ Don't use an Interleaved Sort key unless you 100% know what your doing UNION is believed to perform ~150% worse than UNION ALL. Use UNION ALL instead and if you need to remove duplicate rows look at other methods to do so like a row_number and delete statement. When you use UNION, Redshift tries to remove any duplicate rows, so depending on the size of your data the performance overhead could be huge. This would vastly reduce the benefits of having a dist key. If we used a timestamp as our dist key, that would potentially lead to 86,000 unique dist keys PER day. ❌ Don't use highly unique columns as a Dist Keyįor Example using a timestamp for a dist key would be bad. The reason why the join condition is bad is that one data set is huge the other is small and by JOINING the data you will force data to be redistributed across nodes. The example below shows a good use of sub-query over a join. Sub-queries perform best over JOINS where its a simple IN clause. ❌ Don't use sub-queries for large complex operationsĪvoid using sub-queries on data sets that have multiple conditions and are large in size. Referencing the same table in a query can come at a high performance cost, explore other options like breaking down the query into smaller datasets or use a CASE expression. ❌ Don't join the same table multiple times in the same query If the column you set as your dist key has a lot of NULL values, then all the NULLS will end up on one slice. ❌ Don't assign a column with NULL values as a Dist key Redshift works faster the fewer columns are pulled in. HOWEVER, for everything else you should never be doing select * from unless you absolutely NEED every column. Redshift has a dedicated resource stream for handling small queries, so this rule doesn't apply to you if you are just wanting to do a quick select * from table where limit 50 as your query will be given its own resources. ❌ Don't select * unless it's a small query Enter fullscreen mode Exit fullscreen mode
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |