SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

How to make sure everybody gets everything?

Accepted Solution Solved
Reply
Contributor
Posts: 58
Accepted Solution

How to make sure everybody gets everything?

[ Edited ]

I have ca 50 datasets, that can be  categorised into 3 dataset types I could say

the main joins are: 

 

A: Refer_ID

 

B:Refer_ID  Cl_ID

 

C:  CL_ID Cas_ID

 

I am wanting to make sure every dataset has all 3 variables: Refer_ID, CL_ID , Cas_ID to make the processing later on easier. \

 

There are never the same number of Refer_ID, CL_ID or Cas_ID,

Because there are datasets where Refer_ID has more or less CL_ID and there are also different numbers of CAS_ID depending on the dataset.

 

I am ‘merging' datasets with INNER JOINS, to make sure that every dataset always has all three variables: Refer_ID CL_ID Cas_ID

But they don't end up with the same amount of datapoints at the end. Should they not logically all have then the same amount of datapoints?

How can I improve on my inner join statements?  I am wondering, whether it should be "merged' based on two variables if possible and not only based on one (

on A.CL_ID = B.CL_ID ;)

?

Ideas are most welcome

 

 

 

*/ INNER JOIN - Kea.trails_file_7  */;  
Proc sql;  
Create table Kea.trail7A as   
Select A. * , B.REFER_ID 
from Kea.trails_file_7 A  inner join Kea.REFERENCE2 B
 on A.CL_ID = B.CL_ID ;    
quit;

 

In the example below: 

trails_file_7 has CL_ID  and I use Reference2 to attach  Refer_ID, CL_ID , Cas_ID to trails7 - I want to make sure that trails 7 has all three variables. 

 


Accepted Solutions
Solution
‎07-03-2017 10:11 PM
Contributor
Posts: 58

Re: How to make sure everybody gets everything?

yes

I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1

Then all with  CL_ID  and CAS_ID- to create REFERENCE 2

Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID

then join every single dataset with this REFERENCE_ALL. 

 

Maybe that is a better way

View solution in original post


All Replies
Super User
Posts: 17,801

Re: How to make sure everybody gets everything?

What uniquely identifies a record/person?

 

Contributor
Posts: 58

Re: How to make sure everybody gets everything?

At the end it will be the REFER_ID, but not all datasets have the refer_id. Therefore I have to find ways to include it, and I thought that

 

datasets with only CL_ID could be merged with datasets that have CL_ID and REFER_ID

datasets with only CAS_ID could be merged with datasets that have CL_ID and CAS_ID and

datasets with CAS_ID and  CL_ID could be merged with datasets that have CL_ID or CAS_ID in combination with REFER_ID

 

Therefore I have built a reference dataset with    REFER_ID,  CL_ID and CAS_ID 

and this would then be used to give every dataset a column with REFER_ID. To build that reference dataset, I used the dataset that will build the spine. Therefore, this dataset should have all the REFER_IDs needed. 

 

 

Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?

Many thanks

Super User
Posts: 17,801

Re: How to make sure everybody gets everything?


AnnaNZ wrote:

 

 

Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?

Many thanks



This is known as creating a master 'person' list and is commonly done in data integration projects. So yes, this is the correct approach and how I'd work with it. What you do is create custom formats that will map CAS_ID and CL_ID to REFER_ID in this. But make absolutely sure that it's a one to one relationship, it may not be. For example can a REFER_ID have more than one CAS_ID or CL_ID? Are REFER_ID's unique to a person?

 

 

Contributor
Posts: 58

Re: How to make sure everybody gets everything?

[ Edited ]

   - Are REFER_ID's unique to a person?

Yes they are unique to one person. But one person can have several Clients: CL_ID 

and a CL_ID can have several Cases: CAS_ID

 

So what to do about that? I systematecally dropped all duplicates after generating a new dataset: 

for example: 
 proc sort data=Kea.trail8A noduprecs; 
      by _all_ ; Run; 

 and then joined it with the next set. Is that a sensible way to do it? eventually I should have everywhere the same N ? Right?

Many thanks

Super User
Posts: 17,801

Re: How to make sure everybody gets everything?

Should they all have the same number of records? I can't say to be honest. It depends more on your data than anything else, but in the end aren't you expecting a single master data set anyways, not multiples.

 

 

Solution
‎07-03-2017 10:11 PM
Contributor
Posts: 58

Re: How to make sure everybody gets everything?

yes

I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1

Then all with  CL_ID  and CAS_ID- to create REFERENCE 2

Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID

then join every single dataset with this REFERENCE_ALL. 

 

Maybe that is a better way

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 218 views
  • 0 likes
  • 2 in conversation