BookmarkSubscribeRSS Feed
bshiferaw27
Fluorite | Level 6

Hi everyone, 

 

I wanted to merge two separate data sets using proc SQL and order them by the person_id. However, the person_id on both data sets is named in different ways. For example, dataA person_id is named "BL-ABL-002-Xp" while dataB person_id is named "002" and so on. I was wondering if anyone knows an easier way I would be able to rename the dataA multiple person_id variables to only use the three-digit numbers "002"? I hope my question makes sense.

 

So far the code I am using is:

proc SQL;

create table baseline_comb as

select *

from dataA, dataB

where dataA.person_id = dataB.person_id
order by dataA.person_id;

quit;

 

I appreciate the help in advance!

 

7 REPLIES 7
Reeza
Super User

Are they all formatted exactly as shown, with the - and spaces?

If so, you can use SCAN(). I would also recommend not using a cartesian join, but a full join and change your condition to ON rather than a WHERE. And use aliases for your tables. See an example below.

proc SQL;
create table baseline_comb as
select *
from dataA as A
 full join dataB as B
on scan(a.person_id, 3, "-") = b.person_id
order by a.person_id;

quit;

@bshiferaw27 wrote:

Hi everyone, 

 

I wanted to merge two separate data sets using proc SQL and order them by the person_id. However, the person_id on both data sets is named in different ways. For example, dataA person_id is named "BL-ABL-002-Xp" while dataB person_id is named "002" and so on. I was wondering if anyone knows an easier way I would be able to rename the dataA multiple person_id variables to only use the three-digit numbers "002"? I hope my question makes sense.

 

So far the code I am using is:

proc SQL;

create table baseline_comb as

select *

from dataA, dataB

where dataA.person_id = dataB.person_id
order by dataA.person_id;

quit;

 

I appreciate the help in advance!

 





PaigeMiller
Diamond | Level 26

Please clarify.

 

Is your question about renaming variables (which is what you wrote) or is your question about renaming/modifying the values of the variables?

--
Paige Miller
bshiferaw27
Fluorite | Level 6

Yes, my question is about renaming the variable name to match with dataB.

PaigeMiller
Diamond | Level 26

@bshiferaw27 wrote:

Yes, my question is about renaming the variable name to match with dataB.


In your original code, the variables seem to have the same name, both variables are named person_id, thus no renaming needed.

 

where dataA.person_id = dataB.person_id

 

--
Paige Miller
bshiferaw27
Fluorite | Level 6

Sorry, I guess I did not make it very clear. So the issue is, the way the person_id is named in dataA is different from dataB, therefore it is not letting me merge it by the person_id. What I am trying to do is to rename the entire person_ids included in dataA to having the three digits numbers to match with the person_ids in dataB. For example, how can I rename the entire person_id names in dataA from "BL_BLA_002_Xl" to "002" for all the name variables? 

 

I hope this helps clarify.  

Tom
Super User Tom
Super User

@bshiferaw27 wrote:

Sorry, I guess I did not make it very clear. So the issue is, the way the person_id is named in dataA is different from dataB, therefore it is not letting me merge it by the person_id. What I am trying to do is to rename the entire person_ids included in dataA to having the three digits numbers to match with the person_ids in dataB. For example, how can I rename the entire person_id names in dataA from "BL_BLA_002_Xl" to "002" for all the name variables? 

 

I hope this helps clarify.  


The way the id values are STORED (or CODED) in the two datasets is different. The name of the variable where it is stored does not change so no renaming of the variables is needed. Instead you want to transform or recode the values stored in the variables.

 

If you want to transform the longer values like "BL_BLA_002_Xl" to the shorter values like "002" it looks from your limited example like the SCAN() function is what you want.  So to find the next to the last word delimited by underscores you could use SCAN(PERSON_ID,-2,'_').   Or  if you want the third word use SCAN(PERSON_ID,3,'_');

 

If you are using PROC SQL to combine the datasets you can even do this on the fly.  

 

create table baseline_comb as
  select a.person_id as long_id
       , b.person_id as short_id
       , *
  from dataA a
  inner join dataB b
  on a.person_id = scan(b.person_id,-2,'_')
  order by long_id, short_id
;

 

Patrick
Opal | Level 21

@bshiferaw27 What really would help clarify things is if you'd post representative sample data for your two source tables (SAS datasteps creating the sample data) and then show us the desired result.

You say variables but I still believe you're actually talking about the values stored in variables. Once you provide representative sample data all this ambiguity will just go away and we can concentrate on solving your actual problem.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 949 views
  • 1 like
  • 5 in conversation