BookmarkSubscribeRSS Feed
kiag
Fluorite | Level 6

Dear SAS Experts,

I am currently working on merging two datasets, dataset1 and dataset1, using a common column x. However, I’m facing challenges due to inconsistencies in the format and length of the values in this column across the two datasets. Below are the different scenarios I’ve encountered:

  1. The x column in dataset1 contains missing values, whereas the corresponding values are present in x column in dataset2.

  2. The x column in dataset1 contains 3-digit values (some starting with zero), but the same values appear as single or double digits in x column in dataset2.

  3. In some cases, x column in dataset1 has 3 or 5 digits, while in dataset2 the corresponding value appears as a shorter form. For these, I need to match based on the last 2 digits.

  4. In other cases, x column in dataset1 has 3 digits, while in x column in dataset2 the corresponding value appears as a 5-digit code. For these, I need to match based on the  last 3 digits.

  5. When x column has 5 digits in both datasets, the records should match exactly.

  6. If x column in dataset1 has 5 digits and in x column in dataset2 has 14 digits, I need to match based on the last 5 digits.

Could you please suggest an efficient and reliable approach to standardize or transform these values and perform a successful merge that accommodates these scenarios?

Thank you in advance for your support.

13 REPLIES 13
Ksharp
Super User
Better post some real data to demonstrate these 6 scenarios you mentioned ,and post the desired output .
So other sas users could understand your requirement better and faster.
kiag
Fluorite | Level 6

Sure, 

I have added more information to understand-

kiag_0-1747882917739.png

 

 

sas code code 2 datasets- 

/* Create dataset1 */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;

/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;

kiag
Fluorite | Level 6
Sure, I have added more information in the comments to understand-
kiag
Fluorite | Level 6

I have added more information to understand-

kiag_0-1747882601106.png

sas code code 2 datasets- 

/* Create dataset1 */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;

/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;

 

Kathryn_SAS
SAS Employee

I am not sure how realistic this example is. In order to be able to compare lengths of variables across data sets, you need to be able to combine them first. I have done that by creating an ID variable and used that on a BY statement in a Merge. However, this assumes that the data sets have the same number of observations and that the X values you are comparing are on the same observation. Is this the case in the actual data? Here is a starting point that combines the data, renaming the X variables and assigning the length. If this is the case with the real data, then you can add IF logic to compare lengths and then reassign values.

/* Create dataset1 */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;

/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;

data dataset1;
set dataset1;
id=_n_;
rename x=x1;
run;

data dataset2;
set dataset2;
id=_n_;
rename x=x2;
run;

data combine;
merge dataset1 dataset2;
by id;
len1=length(x1);
len2=length(x2);
run;

proc print data=combine;
run;
kiag
Fluorite | Level 6
No, the data sets do not have the same number of observations and that the X values I am comparing are not on the same observation. In fact there are multiple numbers(observations) repeated as well within the same x column of the.
mkeintz
PROC Star

Can X in dataset 2 ever have more than five digits?  If so, the scenario 6 will be ambiguous.

 

Dataset 1 has 10101010198765

 

Your dataset 2 has 98765.  But it might also have 198765.  If so, then which obs from dataset 2 is chosen?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
kiag
Fluorite | Level 6
no, the x of the dataset contain the below formats -
missing data
3 digits
3 or 5 digits
5 digits

while the dataset2 has -
single /double digit
5 digits
14 digits
mkeintz
PROC Star

You say that dataset1 can have:

missing data
3 digits
3 or 5 digits
5 digits

Yet the fourth observation has X=GA4.  one digit.

 

Please be complete in describing the data, and clarifying the matching rules.

 

Help us help you.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

Do the values with large numbers of digits have meaningful leading zeros?

 

That is can we just assume for the second dataset (the one that appears to slightly more standardized) that if we convert the digits to a number we can divide those number into 4 categories:

 

missing

less 100

less than 100000

others

 

Or better still just convert the digits to a number (truncate to the last 5 digits using MOD()) and match on the number.

data one;
  set dataset1 ;
  digits1=compress(x,,'kd');
  ndigits1=lengthn(digits1);
  number=mod(input(digits1,??32.),1E5);
  rename x=x1;
run;

data two;
  set dataset2 ;
  digits2=compress(x,,'kd');
  ndigits2=lengthn(digits2);
  number=mod(input(digits2,??32.),1E5);
  rename x=x2;
run;

proc sql;
create table check as 
select * 
from one natural full join two
;
quit;
proc print data=check;
  id number;
  var x1 x2 product name digits1 digits2 ndigits1 ndigits2 ;
run;

Tom_0-1748051955928.png

 

kiag
Fluorite | Level 6

this is what I cam up with till now - but it still not giving accurate result

/* Initial Data Setup */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
via9 22201
via10 6631
;
run;

data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 201
i 98765
j 31
;
run;

/* Initial Cleaning Steps for dataset1 */
data d1clean;
set dataset1;
length x1 $15;
x1 = compress(x, '', 'kd');
if not missing(x1) then do;
if input(x1, best32.) = 0 then do;
x1 = '0';
end;
else do;
x1 = cats(input(x1, best32.));
end;
end;
else do;
x1 = ''; /* Ensure blank if no digits */
end;
run;

/* Initial Cleaning Steps for dataset2 */

data d2clean;
set dataset2;
length x1 $15;
x1 = compress(x, '', 'kd');
if not missing(x1) then do;
if input(x1, best32.) = 0 then do;
x1 = '0';
end;
else do;
x1 = cats(input(x1, best32.));
end;
end;
else do;
x1 = ''; /* Ensure blank if no digits */
end;
run;


PROC SQL;
/* Create the final transformed version of d1clean */
CREATE TABLE d1_final_cleaned AS
SELECT
d1.Product,
d1.x AS original_dataset1_x,
d1.x1 AS x1_after_initial_cleaning,
COALESCE(
( /* Subquery to find the shortest d2.x1 that is a suffix of d1.x1 (if d1.x1 is longer) */
SELECT MIN(d2.x1)
FROM d2clean d2
WHERE
LENGTH(TRIM(d2.x1)) > 0 AND
LENGTH(TRIM(d1.x1)) > LENGTH(TRIM(d2.x1)) AND
/* Check if d1.x1 ends with d2.x1 */
INDEX(TRIM(d1.x1), TRIM(d2.x1)) = (LENGTH(TRIM(d1.x1)) - LENGTH(TRIM(d2.x1)) + 1) AND
LENGTH(TRIM(d2.x1)) = (
SELECT MIN(LENGTH(TRIM(d2_inner.x1)))
FROM d2clean d2_inner
WHERE
LENGTH(TRIM(d2_inner.x1)) > 0 AND
LENGTH(TRIM(d1.x1)) > LENGTH(TRIM(d2_inner.x1)) AND
INDEX(TRIM(d1.x1), TRIM(d2_inner.x1)) = (LENGTH(TRIM(d1.x1)) - LENGTH(TRIM(d2_inner.x1)) + 1)
)
),
d1.x1
) AS x1
FROM d1clean d1;

/* Create the final transformed version of d2clean */
CREATE TABLE d2_final_cleaned AS
SELECT
d2.Name,
d2.x AS original_dataset2_x,
d2.x1 AS x1_after_initial_cleaning,
COALESCE(
( /* Subquery to find the shortest d1.x1 that is a suffix of d2.x1 (if d2.x1 is longer) */
SELECT MIN(d1.x1)
FROM d1clean d1
WHERE
LENGTH(TRIM(d1.x1)) > 0 AND
LENGTH(TRIM(d2.x1)) > LENGTH(TRIM(d1.x1)) AND
INDEX(TRIM(d2.x1), TRIM(d1.x1)) = (LENGTH(TRIM(d2.x1)) - LENGTH(TRIM(d1.x1)) + 1) AND
LENGTH(TRIM(d1.x1)) = (
SELECT MIN(LENGTH(TRIM(d1_inner.x1)))
FROM d1clean d1_inner
WHERE
LENGTH(TRIM(d1_inner.x1)) > 0 AND
LENGTH(TRIM(d2.x1)) > LENGTH(TRIM(d1_inner.x1)) AND
INDEX(TRIM(d2.x1), TRIM(d1_inner.x1)) = (LENGTH(TRIM(d2.x1)) - LENGTH(TRIM(d1_inner.x1)) + 1)
)
),
d2.x1
) AS x1
FROM d2clean d2;
QUIT;

/* Turn options back on if you need them for subsequent steps */
/* OPTIONS NOTES STIMER SOURCE SYNTAXCHECK; */

/* Display the results */
PROC PRINT DATA=d1_final_cleaned NOOBS;
TITLE "d1_final_cleaned: x1 transformed based on d2clean suffixes";
RUN;

PROC PRINT DATA=d2_final_cleaned NOOBS;
TITLE "d2_final_cleaned: x1 transformed based on d1clean suffixes";
RUN;

Ksharp
Super User

So How do I know the missing value of dataset1 match 2 of dataset1 ,not 3 or 4 of dataset2 ?

data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;

/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;

proc sql;
create table temp as
select a.Product,a.x as a_x,b.name,b.x as b_x
 from dataset1 as a,dataset2 as b
  where strip(reverse(a.x)) eqt strip(reverse(b.x)) 
   group by a.x
    having length(b.x)=max(length(b.x))  
     order by b.name;

create table want as
select * from temp
union corr
/*Match missing value of dataset1 to 2 of dataset1*/
select *
 from (select product as Product,x as a_x from dataset1 where x is missing),
      (select name as name,x as b_x from dataset2 where b_x not in (select b_x from temp))
;
quit;

Ksharp_0-1748052388802.png

 

mkeintz
PROC Star

This can be done in a single data step that first reads DATASET1, stores it in a hash object (lookup table) with a lookup key created by modifying X (remove leading non-digits, shorten it to the last 5 digits if necessary, and remove leading zeroes.  This is followed by reading DATASET2, where X is similarly modified, and a lookup is performed to see whether it is in the hash object, from which the PRODUCT value is retrieved:

 

data dataset1 (label='x with 3 or 5 digits');
  infile datalines missover; 
  input product $4.  x :$20. ;
datalines;
via1 
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
run;

data dataset2;
  infile datalines ;
  input name $1.  x :$20. ;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
run;

data want (drop=rc);;
  set   dataset1 (in=in1) dataset2 (in=in2);
  where x^='';
  if _n_=1 then do;
    declare hash d1 ();
      d1.definekey('x');
      d1.definedata('product');
      d1.definedone();
  end;

  x=substr(x,anydigit(x)); /*Remove leading non-digits*/
  if length(x)>5 then x=substr(x,length(x)-4);  /*If too long, take last 5 digits*/
 
  do while (x=:'0'); /* Strip leading zeroes*/
    x=substr(x,2);
  end;
  if in1=1 then d1.add();
  if in2;
  rc=d1.find();
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 1186 views
  • 2 likes
  • 5 in conversation