When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup: This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster. I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up. Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. Use workload management to improve ETL runtimes Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput. ![]() Amazon Redshift automatically parallelizes the data ingestion. When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets. The number of files should be a multiple of the number of slices in your cluster. When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion: As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices. ![]() The number of slices per node depends on the node type of the cluster. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. COPY data from multiple, evenly sized filesĪmazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data.
0 Comments
Leave a Reply. |