BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AnnaNZ
Quartz | Level 8

I have ca 50 datasets, that can be  categorised into 3 dataset types I could say

the main joins are: 

 

A: Refer_ID

 

B:Refer_ID  Cl_ID

 

C:  CL_ID Cas_ID

 

I am wanting to make sure every dataset has all 3 variables: Refer_ID, CL_ID , Cas_ID to make the processing later on easier. \

 

There are never the same number of Refer_ID, CL_ID or Cas_ID,

Because there are datasets where Refer_ID has more or less CL_ID and there are also different numbers of CAS_ID depending on the dataset.

 

I am ‘merging' datasets with INNER JOINS, to make sure that every dataset always has all three variables: Refer_ID CL_ID Cas_ID

But they don't end up with the same amount of datapoints at the end. Should they not logically all have then the same amount of datapoints?

How can I improve on my inner join statements?  I am wondering, whether it should be "merged' based on two variables if possible and not only based on one (

on A.CL_ID = B.CL_ID ;)

?

Ideas are most welcome

 

 

 

*/ INNER JOIN - Kea.trails_file_7  */;  
Proc sql;  
Create table Kea.trail7A as   
Select A. * , B.REFER_ID 
from Kea.trails_file_7 A  inner join Kea.REFERENCE2 B
 on A.CL_ID = B.CL_ID ;    
quit;

 

In the example below: 

trails_file_7 has CL_ID  and I use Reference2 to attach  Refer_ID, CL_ID , Cas_ID to trails7 - I want to make sure that trails 7 has all three variables. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
AnnaNZ
Quartz | Level 8

yes

I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1

Then all with  CL_ID  and CAS_ID- to create REFERENCE 2

Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID

then join every single dataset with this REFERENCE_ALL. 

 

Maybe that is a better way

View solution in original post

6 REPLIES 6
Reeza
Super User

What uniquely identifies a record/person?

 

AnnaNZ
Quartz | Level 8

At the end it will be the REFER_ID, but not all datasets have the refer_id. Therefore I have to find ways to include it, and I thought that

 

datasets with only CL_ID could be merged with datasets that have CL_ID and REFER_ID

datasets with only CAS_ID could be merged with datasets that have CL_ID and CAS_ID and

datasets with CAS_ID and  CL_ID could be merged with datasets that have CL_ID or CAS_ID in combination with REFER_ID

 

Therefore I have built a reference dataset with    REFER_ID,  CL_ID and CAS_ID 

and this would then be used to give every dataset a column with REFER_ID. To build that reference dataset, I used the dataset that will build the spine. Therefore, this dataset should have all the REFER_IDs needed. 

 

 

Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?

Many thanks

Reeza
Super User

@AnnaNZ wrote:

 

 

Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?

Many thanks



This is known as creating a master 'person' list and is commonly done in data integration projects. So yes, this is the correct approach and how I'd work with it. What you do is create custom formats that will map CAS_ID and CL_ID to REFER_ID in this. But make absolutely sure that it's a one to one relationship, it may not be. For example can a REFER_ID have more than one CAS_ID or CL_ID? Are REFER_ID's unique to a person?

 

 

AnnaNZ
Quartz | Level 8

   - Are REFER_ID's unique to a person?

Yes they are unique to one person. But one person can have several Clients: CL_ID 

and a CL_ID can have several Cases: CAS_ID

 

So what to do about that? I systematecally dropped all duplicates after generating a new dataset: 

for example: 
 proc sort data=Kea.trail8A noduprecs; 
      by _all_ ; Run; 

 and then joined it with the next set. Is that a sensible way to do it? eventually I should have everywhere the same N ? Right?

Many thanks

Reeza
Super User

Should they all have the same number of records? I can't say to be honest. It depends more on your data than anything else, but in the end aren't you expecting a single master data set anyways, not multiples.

 

 

AnnaNZ
Quartz | Level 8

yes

I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1

Then all with  CL_ID  and CAS_ID- to create REFERENCE 2

Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID

then join every single dataset with this REFERENCE_ALL. 

 

Maybe that is a better way

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1027 views
  • 0 likes
  • 2 in conversation