Surviving Record in Dataflux 2.8- how to handle more then one value in...

beamer108 · Posted 04-28-2021 11:51 AM

Hi

I am working a marketing project where the goal is to de-dup member files using DataFlux 2.8. So if a row for a member appears more then once - I identify the surviving record. Now if the 'child' record(s) have additional information but is different on each row - is there a way to incorporate or add a reference field to the surviving record?

Example -

a member has 3 different records.
Surviving record contains no email address and a phone number (333-4112).
Child row 2 contains email address of gmail.com and phone number (333-4212)
Child row 3 contains email address of hotmail.com and phone number (333-4221).

For the best case scenario and from a marketing perspective - I want the Surviving record to contain or be able to reference the all emails and different phone numbers so I have every opportunity to get information out to this member.

Is this possible? And if so, how would I go about doing this?

Thanks in advance.

Reeza · Posted 04-28-2021 12:01 PM

Yes it's possible and there are multiple ways.

Transpose the emails and phone numbers into a single line per member. Your situation is going from a long to wide format, for multiple variables. A data step approach is faster but more complex from a coding perspective. The typical approach is to use two proc transposes, one for email, one for phone and then merge them. If you know that you'll have a max of 3 per person there may be some other merge options as well but a dynamic approach is definitely safer.

For a more useful answer, please post example data and the output you expect from the data otherwise, generic tutorials are below.

Transposing data tutorials:
Long to Wide:
https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

Wide to Long:
https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-wide-to-long-using-proc-transpose/

https://stats.idre.ucla.edu/sas/modules/reshaping-data-wide-to-long-using-a-data-step/

And sometimes a double transpose is needed for extra wide data sets:
https://gist.github.com/statgeek/2321b6f62ab78d5bf2b0a5a8626bd7cd

@beamer108 wrote:

Hi

I am working a marketing project where the goal is to de-dup member files. So if a row for a member appears more then once - I identify the surviving record. Now if the 'child' record(s) have additional information but is different on each row - is there a way to incorporate or add a reference field to the surviving record?

Example -

a member has 3 different records.

Surviving record contains no email address and a phone number (333-4112).

Child row 2 contains email address of gmail.com and phone number (333-4212)

Child row 3 contains email address of hotmail.com and phone number (333-4221).

For the best case scenario and from a marketing perspective - I want the Surviving record to contain or be able to reference the all emails and different phone numbers so I have every opportunity to get information out to this member.

Is this possible? And if so, how would I go about doing this?

Thanks in advance.

beamer108 · Posted 04-28-2021 12:55 PM

Hi I should have initially specified this was in Dataflux 2.8.

Reeza · Posted 04-28-2021 02:44 PM

Probably would have helped. Pretty sure the solution is the same methodology anyways, assuming you replace the TRANSPOSE code with the TRANSPOSE task.

beamer108 · Posted 04-28-2021 03:27 PM

Im not sure how to apply this to Dataflux. Is it in an expression node that I do this or is it truly a task labeled Transpose that I haven't seen yet. If its an expression node - can you give me some guidance on what the EEL would look like?

Thanks.

Reeza · Posted 04-28-2021 03:40 PM

Sorry, think I mixed up DI Studio.
If you look at the blog post it will give you the idea of the methodology you need and you can implement it in Dataflux

SASKiwi · Posted 04-28-2021 05:04 PM

Why are you limiting yourself to do this only in DataFlux Data Management Studio? If you have the full SAS Data Quality Product then @Reeza 's approach can be used and it's an easier solution to implement.

VincentRejany · Posted 04-28-2021 05:10 PM

Don't want to start the discussion about SAS vs DataFlux approach ... but the challenge is a bit more complex than it seems, and cannot be really addressed with a proc transpose ... as the request requires to analyze each cluster in one stream of data and to repopulate the surviving record with the missing values. And for that particular purpose and for many others, I don't think that DataFlux DM Studio is "limited"

VincentRejany · Posted 04-28-2021 04:08 PM

Hi

Thanks for this little exercise which brings me back almost 10 years ago when I had to work on a similar challenge.

So how to do that? One option is to read each record in a cluster and load an array with all the possible values for one specific token. ie:

Record 1 - Email = xyz@gmail.com -> Array column 1 = xyz@gmail.com

Record 2 - Email = xyz@hotmail.com -> Array column 2 = xyz@hotmail.com

Record 3 - Email = xyz@hotmail.com -> no action

Record 4 - Email = xyz@sas.com -> Array column 3 = xyz@sas.com

and next, when ending with the survivor (yes better if the survivor is at the end of the cluster) ... then you can write back the emails into, by checking that the token is not in already.

look at the job attached, you just have to change the extension from xml to ddf and put it to your file repository.

The code is for sure pretty ugly ... but I am sure you will get the idea

beamer108 · Posted 05-07-2021 03:35 PM

Sorry VincentRejany for taking so long to get back to you. This worked perfectly and I was able to tweak it for 3 different nodes, looking at email, address and phone number, but honestly all i did was copy and paste the code. The reason I have been so long responding is because I am trying to pull the code apart to understand exactly what it is doing so I can learn and use it for future. Arrays are very new to me so this one is mind boggling. Nothing that I have googled has really helped explain it. Is there any tips or documentation you can steer me towards in understanding how you pulled this together, or arrays in general as they relate to Expression Language I would really appreciate. If this is not the forum for me to get that information from, i understand that as well. Thank you for this very very helpful information.

VincentRejany · Posted 05-10-2021 06:46 AM

Hi

No issue.

There is not really that much to get. The expression language is quite powerful for doing row by row operations, or through group of rows (clusters).

Arrays can of course 'live" over multiple rows, therefore you can add whatever into as a temp table.

https://documentation.sas.com/doc/en/engelref/2.8/p1hmnxhrls0dsen1o5ga29unjrxe.htm

https://documentation.sas.com/api/docsets/engelref/2.8/content/engelref.pdf?locale=en

Surviving Record in Dataflux 2.8- how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Re: Surviving Record - how to handle more then one value in child records

Ready to join fellow brilliant minds for the SAS Hackathon?