I have ca 50 datasets, that can be categorised into 3 dataset types I could say
the main joins are:
A: Refer_ID
B:Refer_ID Cl_ID
C: CL_ID Cas_ID
I am wanting to make sure every dataset has all 3 variables: Refer_ID, CL_ID , Cas_ID to make the processing later on easier. \
There are never the same number of Refer_ID, CL_ID or Cas_ID,
Because there are datasets where Refer_ID has more or less CL_ID and there are also different numbers of CAS_ID depending on the dataset.
I am ‘merging' datasets with INNER JOINS, to make sure that every dataset always has all three variables: Refer_ID CL_ID Cas_ID
But they don't end up with the same amount of datapoints at the end. Should they not logically all have then the same amount of datapoints?
How can I improve on my inner join statements? I am wondering, whether it should be "merged' based on two variables if possible and not only based on one (
on A.CL_ID = B.CL_ID ;)
?
Ideas are most welcome
*/ INNER JOIN - Kea.trails_file_7 */; Proc sql; Create table Kea.trail7A as Select A. * , B.REFER_ID from Kea.trails_file_7 A inner join Kea.REFERENCE2 B on A.CL_ID = B.CL_ID ; quit;
In the example below:
trails_file_7 has CL_ID and I use Reference2 to attach Refer_ID, CL_ID , Cas_ID to trails7 - I want to make sure that trails 7 has all three variables.
yes
I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1
Then all with CL_ID and CAS_ID- to create REFERENCE 2
Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID
then join every single dataset with this REFERENCE_ALL.
Maybe that is a better way
What uniquely identifies a record/person?
At the end it will be the REFER_ID, but not all datasets have the refer_id. Therefore I have to find ways to include it, and I thought that
datasets with only CL_ID could be merged with datasets that have CL_ID and REFER_ID
datasets with only CAS_ID could be merged with datasets that have CL_ID and CAS_ID and
datasets with CAS_ID and CL_ID could be merged with datasets that have CL_ID or CAS_ID in combination with REFER_ID
Therefore I have built a reference dataset with REFER_ID, CL_ID and CAS_ID
and this would then be used to give every dataset a column with REFER_ID. To build that reference dataset, I used the dataset that will build the spine. Therefore, this dataset should have all the REFER_IDs needed.
Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?
Many thanks
@AnnaNZ wrote:
Is that line of thought correct? And how do I make sure my otherdatasets get all these REFER_ID attached to them ?
Many thanks
This is known as creating a master 'person' list and is commonly done in data integration projects. So yes, this is the correct approach and how I'd work with it. What you do is create custom formats that will map CAS_ID and CL_ID to REFER_ID in this. But make absolutely sure that it's a one to one relationship, it may not be. For example can a REFER_ID have more than one CAS_ID or CL_ID? Are REFER_ID's unique to a person?
- Are REFER_ID's unique to a person?
Yes they are unique to one person. But one person can have several Clients: CL_ID
and a CL_ID can have several Cases: CAS_ID
So what to do about that? I systematecally dropped all duplicates after generating a new dataset:
for example: proc sort data=Kea.trail8A noduprecs; by _all_ ; Run;
and then joined it with the next set. Is that a sensible way to do it? eventually I should have everywhere the same N ? Right?
Many thanks
Should they all have the same number of records? I can't say to be honest. It depends more on your data than anything else, but in the end aren't you expecting a single master data set anyways, not multiples.
yes
I started from scratch and joined all that have the REFER_ID and Cl_ID - to create REFERENCE 1
Then all with CL_ID and CAS_ID- to create REFERENCE 2
Join REFERENCE 1 & REFERENCE 2 to create a reference dataset for all REFERENCE_ALL with REFER_ID and Cl_IDand CAS_ID
then join every single dataset with this REFERENCE_ALL.
Maybe that is a better way
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.