Solved: Re: How to make sure everybody gets everything?

AnnaNZ · Posted 07-02-2017 07:31 PM

I have ca 50 datasets, that can be categorised into 3 dataset types I could say

the main joins are:

A: Refer_ID

B:Refer_ID Cl_ID

C: CL_ID Cas_ID

I am wanting to make sure every dataset has all 3 variables: Refer_ID, CL_ID , Cas_ID to make the processing later on easier. \

There are never the same number of Refer_ID, CL_ID or Cas_ID,

Because there are datasets where Refer_ID has more or less CL_ID and there are also different numbers of CAS_ID depending on the dataset.

I am ‘merging' datasets with INNER JOINS, to make sure that every dataset always has all three variables: Refer_ID CL_ID Cas_ID

But they don't end up with the same amount of datapoints at the end. Should they not logically all have then the same amount of datapoints?

How can I improve on my inner join statements? I am wondering, whether it should be "merged' based on two variables if possible and not only based on one (

on A.CL_ID = B.CL_ID ;)

?

Ideas are most welcome

*/ INNER JOIN - Kea.trails_file_7  */;  
Proc sql;  
Create table Kea.trail7A as   
Select A. * , B.REFER_ID 
from Kea.trails_file_7 A  inner join Kea.REFERENCE2 B
 on A.CL_ID = B.CL_ID ;    
quit;

In the example below:

trails_file_7 has CL_ID  and I use Reference2 to attach  Refer_ID, CL_ID , Cas_ID to trails7 - I want to make sure that trails 7 has all three variables.

AnnaNZ · Posted 07-02-2017 09:56 PM

yes

I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1

Then all with CL_ID and CAS_ID- to create REFERENCE 2

Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID

then join every single dataset with this REFERENCE_ALL.

Maybe that is a better way

View solution in original post

Reeza · Posted 07-02-2017 08:23 PM

What uniquely identifies a record/person?

AnnaNZ · Posted 07-02-2017 08:34 PM

At the end it will be the REFER_ID, but not all datasets have the refer_id. Therefore I have to find ways to include it, and I thought that

datasets with only CL_ID could be merged with datasets that have CL_ID and REFER_ID

datasets with only CAS_ID could be merged with datasets that have CL_ID and CAS_ID and

datasets with CAS_ID and CL_ID could be merged with datasets that have CL_ID or CAS_ID in combination with REFER_ID

Therefore I have built a reference dataset with REFER_ID, CL_ID and CAS_ID

and this would then be used to give every dataset a column with REFER_ID. To build that reference dataset, I used the dataset that will build the spine. Therefore, this dataset should have all the REFER_IDs needed.

Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?

Many thanks

Reeza · Posted 07-02-2017 08:46 PM

@AnnaNZ wrote:

Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?

Many thanks

This is known as creating a master 'person' list and is commonly done in data integration projects. So yes, this is the correct approach and how I'd work with it. What you do is create custom formats that will map CAS_ID and CL_ID to REFER_ID in this. But make absolutely sure that it's a one to one relationship, it may not be. For example can a REFER_ID have more than one CAS_ID or CL_ID? Are REFER_ID's unique to a person?

AnnaNZ · Posted 07-02-2017 08:52 PM

- Are REFER_ID's unique to a person?

Yes they are unique to one person. But one person can have several Clients: CL_ID

and a CL_ID can have several Cases: CAS_ID

So what to do about that? I systematecally dropped all duplicates after generating a new dataset:

for example: 
 proc sort data=Kea.trail8A noduprecs; 
      by _all_ ; Run;

and then joined it with the next set. Is that a sensible way to do it? eventually I should have everywhere the same N ? Right?

Many thanks

Reeza · Posted 07-02-2017 09:50 PM

Should they all have the same number of records? I can't say to be honest. It depends more on your data than anything else, but in the end aren't you expecting a single master data set anyways, not multiples.

AnnaNZ · Posted 07-02-2017 09:56 PM

yes

I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1

Then all with CL_ID and CAS_ID- to create REFERENCE 2

Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID

then join every single dataset with this REFERENCE_ALL.

Maybe that is a better way

Registration is open