BookmarkSubscribeRSS Feed
beamer108
Quartz | Level 8

Hi

I am working a marketing project where the goal is to de-dup member files using DataFlux 2.8. So if a row for a member appears more then once - I identify the surviving record.  Now if the 'child' record(s) have additional information but is different on each row - is there a way to incorporate or add a reference field to the surviving record? 

Example -

  • a member has 3 different records.
  • Surviving record contains no email address and a phone number (333-4112). 
  • Child row 2 contains email address of gmail.com and phone number (333-4212)
  • Child row 3 contains email address of hotmail.com and phone number (333-4221).

For the best case scenario and from a marketing perspective - I want the Surviving record to contain or be able to reference the all emails and different phone numbers so I have every opportunity to get information out to this member.

Is this possible? And if so, how would I go about doing this?

Thanks in advance.

10 REPLIES 10
Reeza
Super User

Yes it's possible and there are multiple ways. 

 

Transpose the emails and phone numbers into a single line per member. Your situation is going from a long to wide format, for multiple variables. A data step approach is faster but more complex from a coding perspective. The typical approach is to use two proc transposes, one for email, one for phone and then merge them. If you know that you'll have a max of 3 per person there may be some other merge options as well but a dynamic approach is definitely safer. 

 

For a more useful answer, please post example data and the output you expect from the data otherwise, generic tutorials are below.

 

Transposing data tutorials:
Long to Wide:
https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

Wide to Long:
https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-wide-to-long-using-proc-transpose/

https://stats.idre.ucla.edu/sas/modules/reshaping-data-wide-to-long-using-a-data-step/

And sometimes a double transpose is needed for extra wide data sets:
https://gist.github.com/statgeek/2321b6f62ab78d5bf2b0a5a8626bd7cd

 

 


@beamer108 wrote:

Hi

I am working a marketing project where the goal is to de-dup member files. So if a row for a member appears more then once - I identify the surviving record.  Now if the 'child' record(s) have additional information but is different on each row - is there a way to incorporate or add a reference field to the surviving record? 

Example -

  • a member has 3 different records.
  • Surviving record contains no email address and a phone number (333-4112). 
  • Child row 2 contains email address of gmail.com and phone number (333-4212)
  • Child row 3 contains email address of hotmail.com and phone number (333-4221).

For the best case scenario and from a marketing perspective - I want the Surviving record to contain or be able to reference the all emails and different phone numbers so I have every opportunity to get information out to this member.

Is this possible? And if so, how would I go about doing this?

Thanks in advance.


 

beamer108
Quartz | Level 8

Hi I should have initially specified this was in Dataflux 2.8.

Reeza
Super User
Probably would have helped. Pretty sure the solution is the same methodology anyways, assuming you replace the TRANSPOSE code with the TRANSPOSE task.
beamer108
Quartz | Level 8

Im not sure how to apply this to Dataflux. Is it in an expression node that I do this or is it truly a task labeled Transpose that I haven't seen yet.  If its an expression node - can you give me some guidance on what the EEL would look like?

Thanks.

Reeza
Super User
Sorry, think I mixed up DI Studio.
If you look at the blog post it will give you the idea of the methodology you need and you can implement it in Dataflux
SASKiwi
PROC Star

Why are you limiting yourself to do this only in DataFlux Data Management Studio? If you have the full SAS Data Quality Product then @Reeza 's approach can be used and it's an easier solution to implement.

VincentRejany
SAS Employee

Don't want to start the discussion about SAS vs DataFlux approach ... but the challenge is a bit more complex than it seems, and cannot be really addressed with a proc transpose ... as the request requires to analyze each cluster in one stream of data and to repopulate the surviving record with the missing values. And for that particular purpose and for many others, I don't think that DataFlux DM Studio is "limited"

VincentRejany
SAS Employee

Hi

Thanks for this little exercise which brings me back almost 10 years ago when I had to work on a similar challenge.

So how to do that? One option is to read each record in a cluster and load an array with all the possible values for one specific token. ie:

Record 1 - Email = xyz@gmail.com -> Array column 1 = xyz@gmail.com

Record 2 - Email = xyz@hotmail.com -> Array column 2 = xyz@hotmail.com

Record 3 - Email = xyz@hotmail.com -> no action

Record 4 - Email = xyz@sas.com -> Array column 3 = xyz@sas.com

and next, when ending with the survivor (yes better if the survivor is at the end of the cluster) ... then you can write back the emails into, by checking that the token is not in already.

look at the job attached, you just have to change the extension from xml to ddf and put it to your file repository.

The code is for sure pretty ugly ... but I am sure you will get the idea

VincentRejany_0-1619640471334.png

 

 

beamer108
Quartz | Level 8

Sorry VincentRejany for taking so long to get back to you.  This worked perfectly and I was able to tweak it for 3 different nodes, looking at email, address and phone number, but honestly all i did was copy and paste the code. The reason I have been so long responding is because I am trying to pull the code apart to understand exactly what it is doing so I can learn and use it for future.  Arrays are very new to me so this one is mind boggling. Nothing that I have googled has really helped explain it.  Is there any tips or documentation you can steer me towards in understanding how you pulled this together, or arrays in general as they relate to Expression Language I would really appreciate.  If this is not the forum for me to get that information from, i understand that as well.  Thank you for this very very helpful information. 

VincentRejany
SAS Employee

Hi

 

No issue.

There is not really that much to get. The expression language is quite powerful for doing row by row operations, or through group of rows (clusters).

Arrays can of course 'live" over multiple rows, therefore you can add whatever into as a temp table.

https://documentation.sas.com/doc/en/engelref/2.8/p1hmnxhrls0dsen1o5ga29unjrxe.htm

https://documentation.sas.com/api/docsets/engelref/2.8/content/engelref.pdf?locale=en

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1568 views
  • 3 likes
  • 4 in conversation