BookmarkSubscribeRSS Feed
Mscarboncopy
Pyrite | Level 9

Hello.

I have the following 2 data sets with different variables that I need to merge into one. My problem is that the second file ID is different from the first.

The common var is only VAR1. I am not giving examples for what is in the other variables in each set as I do not think they will give me an issue. Just listing the names here for an idea of what my outcome would look like.

Is there a way to reassign a record ID based on a common var like this? I would imagine creating a new ID var - dropping var  record_ID and then renaming the new ID ID like in SET1.

I can merge the file by VAR1 but it is not the output I am looking for, as I get 2 records per ID.

 

SET 1

input ID Var1 P1 $ P2 P3 P4 P5;

Datalines;

1    3456

2   8796

3   2222

4  7689

;

 

SET 2

Input record_ID VAR1  B1 B2 B3 $ B4 B5;

Datalines;

10                3456

12                8796

15                2222

20               7689

;

What I need is to make record_id 10 12 15 20 become 1 2 3 4, based on that common VAR1 and have all variables in one record. Like this:

 

ID VAR1   P1 P2 P3 P4 P5 B1 B2 B3 B4 B5

1     3456

2    8796

3    2222

4    7689

 

Thank you!

19 REPLIES 19
himself
Pyrite | Level 9

Hi you can achieve this in SAS by merging the datasets based on the common variable VAR1 and then reassigning the record_ID to match the ID in SET1. Here's how you can do it:

 

/* Step 1: Create SET1 */
data SET1;
    input ID Var1 P1 $ P2 P3 P4 P5;
    datalines;
1 3456
2 8796
3 2222
4 7689
;
run;

/* Step 2: Create SET2 */
data SET2;
    input record_ID VAR1 B1 B2 B3 $ B4 B5;
    datalines;
10 3456
12 8796
15 2222
20 7689
;
run;

/* Step 3: Merge datasets based on VAR1 */
proc sql;
    create table merged as
    select a.ID, a.Var1, a.P1, a.P2, a.P3, a.P4, a.P5, 
           b.B1, b.B2, b.B3, b.B4, b.B5
    from SET1 as a
    inner join SET2 as b
    on a.Var1 = b.Var1;
quit;

/* Step 4: Display the merged dataset */
proc print data=merged;
run;

Using PROC SQL to merge datasets in SAS has several advantages:

  • Flexibility: PROC SQL allows for complex joins and merges, including inner joins, outer joins, and conditional joins, which can be more flexible than traditional DATA step merges.
  • Simplicity: The syntax for merging datasets using PROC SQL can be simpler and more intuitive, especially for those familiar with SQL.

References

Match Merging Data Files Using PROC SQL 

 

 

Tom
Super User Tom
Super User

Assuming that VAR1 uniquely defines the observations in at least one of the two dataset just do a MERGE.

proc sort data=set1;
  by var1;
run;
proc sort data=set2;
  by var1;
run;
data want;
   merge set1 set2;
   by var1;
run;
Mscarboncopy
Pyrite | Level 9

Thank you @Tom . When I do that I have 2 records per ID in my merged file, as the variables are not all the same aside from Var1. I need a file with only one record per ID.

I tried what Ksharp suggested and it worked, pending troubleshooting why the values from my set 1 are showing up as missing.

Tom
Super User Tom
Super User

@Mscarboncopy wrote:

Thank you @Tom . When I do that I have 2 records per ID in my merged file, as the variables are not all the same aside from Var1. I need a file with only one record per ID.

I tried what Ksharp suggested and it worked, pending troubleshooting why the values from my set 1 are showing up as missing.


Why do you have 2 records per ID?  Did you start with multiple records per ID (or as it was called in the other dataset RECORD_ID?) to begin with?  Did you have multiple records per VAR1 in either of the datasets?

What do you want to do to reduce 2 records into one?

 

I do not see duplicates or missing values with your example set of ID/VAR1 values.

Spoiler
data set1 ;
  input ID Var1 P1 $ P2 P3 P4 P5;
datalines;
1 3456 A 2 3 4 5
2 8796 B 1 2 2 4
3 2222 C 5 6 7 8
4 7689 D 9 1 2 3
;

 
data set2;
  input record_ID VAR1 B1 B2 B3 $ B4 B5;
datalines;
10 3456 1 2 A 3 4
12 8796 5 6 B 7 8
15 2222 9 1 C 2 3
20 7689 4 5 D 6 7
;

proc sort data=set1; by var1; run;
proc sort data=set2; by var1; run;

data want;
  length new_id 8 ;
  merge set1 set2 ;
  by var1;
  new_id = coalesce(record_id,id);
run;

Tom_0-1741870949593.png

 

 

 

Mscarboncopy
Pyrite | Level 9

It is so odd @Tom. I tried your code and this is what I get. We have data being collected separately and this is why the same participants are getting different IDs. I am trying to merge the files and keep one ID (which is in my example var ID) and one record per participant.  When I do it the way Ksharp suggested I get them all in one record but I am missing all the data from Set1

This is your suggestion  I will post next what I am getting with Ksharp's idea. Thank you for your patience.

OBSnew_idIDVar1P1P2P3P4P5Record_idB1B2B3B4B5
115 2222     1591C23
210 3456     1012A34
320 7689     2045D67
412 8796     1256B78
5332222C5678      
6113456A2345      
7447689D9123      
8228796B1224     

 

 

Mscarboncopy
Pyrite | Level 9

@Tom @Ksharp What I am getting with Ksharp's code. It is what I need but I need the values to be there as well as the ID.

OBSIDVar1P1P2P3P4P5B1B2B3B4B5
1 2222     91C23
2 3456     12A34
3 7689     45D67
4 8796     56B78
Tom
Super User Tom
Super User

To get that output then values of VAR1 that LOOK the same like 2222 must not actually BE the same.

 

So if VAR1 is CHARACTER then one of the two values has leading spaces.  SAS will ignore trailing spaces but leading spaces are not ignored.  

'   2222' ne '2222' 

NOTE: If you print the values using ODS then the leading spaces are removed by ODS so you cannot see them in the report.  You can attach the $QUOTE format to the variable. Then leading spaces will be apparent in ODS output.

 

You can fix this by applying the LEFT() function to move the leading spaces to trailing spaces.

 

If VAR1 is NUMERIC then there is some small difference in the values.  If the values should be integers then you can apply the ROUND() function so the values match what was printed.  If the integers use more than 16 digits then you have values that are too large to be stored uniquely as numbers in SAS.  Convert them to character variables instead.

 

I can reproduce your output by adding leading spaces to VAR1 in SET1.

data set1 ;
  input ID Var1 $ P1 $ P2 P3 P4 P5;
  var1=' '||var1;
datalines;
1 3456 A 2 3 4 5
2 8796 B 1 2 2 4
3 2222 C 5 6 7 8
4 7689 D 9 1 2 3
;


data set2;
  input record_ID VAR1 $ B1 B2 B3 $ B4 B5;
datalines;
10 3456 1 2 A 3 4
12 8796 5 6 B 7 8
15 2222 9 1 C 2 3
20 7689 4 5 D 6 7
;

proc sort data=set1; by var1; run;
proc sort data=set2; by var1; run;

data want;
  length new_id 8 ;
  merge set1 set2 ;
  by var1;
  new_id = coalesce(record_id,id);
run;

proc print;
run;
Mscarboncopy
Pyrite | Level 9

@TomThank you again. I did not think of this at all. Var1 is actually a string so that was exactly the issue. I transformed it in Numeric and that seems to have solved the issue. I am assuming transforming into numeric is the solution for any trailing? Am I right? As participants can easily add a leading space here and there.

I like it that there are different ways to reach the same result. I have to investigate a little more as with your code I am getting an extra ID for some reason, as opposed to when I use the code from KSharp. But I am pulling from my data not the example I gave. Thank you for your guidance. It is really appreciated. 

Tom
Super User Tom
Super User

Converting to numbers will work as long has you do not have one of these conditions:

  • Significant leading zeros. ID=0001 is not the same entity as ID=1.
  • Values that are larger than the SAS can store uniquely as a number.
1    data _null_;
2      maxint = constant('exactint');
3      put maxint=comma23.;
4    run;

maxint=9,007,199,254,740,992
Mscarboncopy
Pyrite | Level 9

@Tom @Ksharp Another question. I added different numbers on VAR1  and noticed that this one gave me an issue: 657899654.

 

This is curious:

1. Using Tom's code it applied the correct ID but it created another record with a round up number of 700000000 with a blank ID. The record that get the ID does not have the data that would be coming from set 2. The extra record does.

2. If I use Ksharp's code, no extra record is created but that specific record does not get the ID, ID is blank and the merging variable number VAR1 is transformed to 700000000. All other variables are in, as they should. 

 

It looks like numbers up to 7 digits work fine. I did not try 8 digits, is it safe to say that up to 8 digits it would work fine? Is the issue here that the number is 9 digit?

Why do the codes not work for that specific Var1 # ?

I can make sure the VAR1 number is not >7 or 8 digit, if that is the issue.

I wanted to understand what is happening before moving forward.

Why would SAS round up that particular Var 1 value (on both codes) and split the data on 1 and not assign the ID on 2? It did not do the same for any of the other values I entered (up to 7 digits).

Thanks.

 

 

Tom
Super User Tom
Super User

Need to see what code you actually used.  Is VAR1 numeric or character?  How did you create the datasets?  Did you run a data step to read in text lines?  From a file? From in-line cards/datalines?  Did you read in some existing dataset?  What was the source?

 

I cannot think on any operations that would round 657,899,654 to 700,000,000 other than rounding to 100 millions place.

var1=round(var1,100000000);
Mscarboncopy
Pyrite | Level 9

I transformed VAR1 that is being collected as character into numeric to solve my issue with leading spaces for the code you suggested I use for the merge/re-coding of the var ID.

This is fake data I am entering, the data is coming from a REDCap form that is being exported as a SAS file. I use the SAS files to run through my code (The two files I mentioned before set 1 and set 2). I did not enter the values in datalines. 

In the code you suggested I use, you mentioned something about a possible issue coming from a number that is too long?

Could that have been it?

Tom
Super User Tom
Super User

@Mscarboncopy wrote:

I transformed VAR1 that is being collected as character into numeric to solve my issue with leading spaces for the code you suggested I use for the merge/re-coding of the var ID.

This is fake data I am entering, the data is coming from a REDCap form that is being exported as a SAS file. I use the SAS files to run through my code (The two files I mentioned before set 1 and set 2). I did not enter the values in datalines. 

In the code you suggested I use, you mentioned something about a possible issue coming from a number that is too long?

Could that have been it?


You should not see issues with dealing with integer values of that length with SAS.  Not sure a about what REDCAP is doing internally or how they attempt to convert what they have internally into SAS datasets.  Perhaps the field length in REDCAP is limited in some way causing some truncation?  Perhaps REDCAP's idea of how to convert what they have into SAS datasets is flawed?

 

Look at the values for the two observations that you think should match before your attempt to convert them on the SAS side into numbers.  To really see what character variables have use the $HEX format.  That will display each byte as two hexadecimal digits.  A space will be show as hex code 20.  The digits will be hexcodes 30 to 39.  Any other hexcodes will cause trouble.

 

You can get some "rounding" if you where to write out that number using SAS's BEST format without enough characters to represent all of the value.  To get only one digit you would have to use BEST3. format, which would result in displaying 7E8 which when converted back into a number would become 7 with 8 zeros.

22   data test;
23     string='657,899,654';
24     number1=input(string,comma32.);
25     string2=put(number1,best3.);
26     number2=input(string2,32.);
27     put (_all_) (=/);
28   run;


string=657,899,654
number1=657899654
string2=7E8
number2=700000000

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 19 replies
  • 2553 views
  • 4 likes
  • 4 in conversation