Partitioning in Datastage

Hey, hope you are doing well.

Now in this article we will try to learn different Partitioning and collecting techniques which is one of the essential parts in parallel processing with some examples as well.

Before starting with these techniques hope we all are well aware of parallel processing. To be in short, parallel processing is the biggest advantage of Datastage, when we compare it with similar tool like Informatica. It divide the data in small segments or portions and each got processed by each node (number of nodes defined in Configuration file) in parallel which yields in faster execution or processing of data.

Basically there are two methods or types of partitioning in datastage

  1. Key less Partitioning
  2. Key Based Partitioning

As the names suggest we can guess that partitioning is based on the key column whichever you defined.

Below partition techniques which can be implemented in parallel stages explained with the examples

  • Auto –

In this method data gets partitioned as per Datastage decision. Datastage chooses best partitioning method depending on Mode of execution in current stage and preceding stage and number of nodes defined in configuration file.

  • Entire –

In this method of partitioning every node gets or receives the whole input data i.e. input data got replicated at each node. This method comes handy when Look Up stage comes into the picture. Its a default partitioning method for Look Up stage. This is a less frequently used method as it takes large memory due to data replication at each node.

  • Hash –

This is a frequently used partition method. Here data got divided across each node depending on the values in key column. Rows with having same key column values go to the same partitions. It does not guarantee equal distribution of data across the partition which may yield in poor performance.

Reason – Consider a key column contain city names where large number of records are from one or two cities so it may lead to situation where some nodes are required to process more records than other nodes.

Hash Partitioning

  • Random –

As the name suggests it randomly distribute the data across all over the partitions and ensures approximately equal sized partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing required calculating a random value for each record.


  • Round Robin –

This is a frequently used method as it’s a very efficient and ensures data got divided equally across all the nodes.  1st record goes to the first processing node, 2nd record to 2nd processing node and when it reaches to last node it starts over

This method guarantees an exact load balance between nodes and is very fast.

Round Robin Partitoning

 

  • Same –

This one is also frequently used partitioning method which uses same partition method used in previous stage i.e. it doesn’t move data across the nodes again if partition is done in earlier stage which saves great amount of time and yields faster execution

That’s it for partitioning. There are collecting methods which works exactly opposite to the partitioning. It collects data divided across the nodes.

Below are the methods Datastage supports for collecting data.

  1. Auto
  2. Round robin
  3. Ordered
  4. Sort Merge

 


Leave a Reply

© 2017 Database ETL. All rights reserved.