SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Dataflux Split data and Cluster best Practice

Reply
New Contributor
Posts: 2

Dataflux Split data and Cluster best Practice

Hi All,

 

This is my first post under Datamanagement.

I have a scenario which needs to be Adressed using DataFlux:

 

Say I have 4 fields Roll , Name , Marks and flag  . I have the following source records:

 

1 A              1

2 A  100       0

3 A 99         1

4 A 111        0

5 A  109       1

6 A 19          0

 

1) Before the clustering, How could I have this single table split into two based on FLAG ?

2) How could I add and update an additional field ( to hold the flag to say if the corresponding field value is considered for single view) for each source field in the in the clustered output. 

 

Thanks,
Sandeep

 

SAS Super FREQ
Posts: 97

Re: Dataflux Split data and Cluster best Practice

Posted in reply to sanaliticscap

Hi Sandeep,

 

You can use a Branch node followed by two Data Validation nodes. In one Data Validation node, write an expression where FLAG=0 and in the other, an expression where FLAG=1. You'll end up with two branches in your job.

 

I'm not sure follow your second question but I'll try to answer. You can use an Expression node to create new variables (synonym for field or column) in your data flows. If you need to have logic that scans clusters to populate values in your new field, look at the Grouping tab of the Expression node. Here you can initialize variables each time your code hits a new cluster. So you can use this technique to do counts or set flags. Make sure you sort by cluster number first and use the cluster number as your "group by" variable.

 

Ron

New Contributor
Posts: 2

Re: Dataflux Split data and Cluster best Practice

Posted in reply to RonAgresta

Hi Ron,

 

For the first one... I agree that branching and datavalidation is one of the solution , but you could see that we are referring the complete table two times during the data validation, which increases the load.but if we have the splitter kind of functionality, we would have split the table in one step.

 

The second question is all about an easy way to link the clustered rows to the Survival record , because the survival node would have already worked on the logic on which field record to select from which row. Also an expression would not be favorable as we might have say 50 flags+ for 50 fields.

 

Anyways, I would like to thank you for the response and your contribution towards community. IT HELPS ALL.

 

Thanks,

sandeep

 

Ask a Question
Discussion stats
  • 2 replies
  • 156 views
  • 0 likes
  • 2 in conversation