Dear SAS Experts,
I am currently working on merging two datasets, dataset1 and dataset1, using a common column x. However, I’m facing challenges due to inconsistencies in the format and length of the values in this column across the two datasets. Below are the different scenarios I’ve encountered:
The x column in dataset1 contains missing values, whereas the corresponding values are present in x column in dataset2.
The x column in dataset1 contains 3-digit values (some starting with zero), but the same values appear as single or double digits in x column in dataset2.
In some cases, x column in dataset1 has 3 or 5 digits, while in dataset2 the corresponding value appears as a shorter form. For these, I need to match based on the last 2 digits.
In other cases, x column in dataset1 has 3 digits, while in x column in dataset2 the corresponding value appears as a 5-digit code. For these, I need to match based on the last 3 digits.
When x column has 5 digits in both datasets, the records should match exactly.
If x column in dataset1 has 5 digits and in x column in dataset2 has 14 digits, I need to match based on the last 5 digits.
Could you please suggest an efficient and reliable approach to standardize or transform these values and perform a successful merge that accommodates these scenarios?
Thank you in advance for your support.
Sure,
I have added more information to understand-
sas code code 2 datasets-
/* Create dataset1 */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;
/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;
I have added more information to understand-
sas code code 2 datasets-
/* Create dataset1 */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;
/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;
I am not sure how realistic this example is. In order to be able to compare lengths of variables across data sets, you need to be able to combine them first. I have done that by creating an ID variable and used that on a BY statement in a Merge. However, this assumes that the data sets have the same number of observations and that the X values you are comparing are on the same observation. Is this the case in the actual data? Here is a starting point that combines the data, renaming the X variables and assigning the length. If this is the case with the real data, then you can add IF logic to compare lengths and then reassign values.
/* Create dataset1 */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;
/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;
data dataset1;
set dataset1;
id=_n_;
rename x=x1;
run;
data dataset2;
set dataset2;
id=_n_;
rename x=x2;
run;
data combine;
merge dataset1 dataset2;
by id;
len1=length(x1);
len2=length(x2);
run;
proc print data=combine;
run;
Can X in dataset 2 ever have more than five digits? If so, the scenario 6 will be ambiguous.
Dataset 1 has 10101010198765
Your dataset 2 has 98765. But it might also have 198765. If so, then which obs from dataset 2 is chosen?
You say that dataset1 can have:
missing data
3 digits
3 or 5 digits
5 digits
Yet the fourth observation has X=GA4. one digit.
Please be complete in describing the data, and clarifying the matching rules.
Help us help you.
Do the values with large numbers of digits have meaningful leading zeros?
That is can we just assume for the second dataset (the one that appears to slightly more standardized) that if we convert the digits to a number we can divide those number into 4 categories:
missing
less 100
less than 100000
others
Or better still just convert the digits to a number (truncate to the last 5 digits using MOD()) and match on the number.
data one;
set dataset1 ;
digits1=compress(x,,'kd');
ndigits1=lengthn(digits1);
number=mod(input(digits1,??32.),1E5);
rename x=x1;
run;
data two;
set dataset2 ;
digits2=compress(x,,'kd');
ndigits2=lengthn(digits2);
number=mod(input(digits2,??32.),1E5);
rename x=x2;
run;
proc sql;
create table check as
select *
from one natural full join two
;
quit;
proc print data=check;
id number;
var x1 x2 product name digits1 digits2 ndigits1 ndigits2 ;
run;
this is what I cam up with till now - but it still not giving accurate result
/* Initial Data Setup */
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
via9 22201
via10 6631
;
run;
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 201
i 98765
j 31
;
run;
/* Initial Cleaning Steps for dataset1 */
data d1clean;
set dataset1;
length x1 $15;
x1 = compress(x, '', 'kd');
if not missing(x1) then do;
if input(x1, best32.) = 0 then do;
x1 = '0';
end;
else do;
x1 = cats(input(x1, best32.));
end;
end;
else do;
x1 = ''; /* Ensure blank if no digits */
end;
run;
/* Initial Cleaning Steps for dataset2 */
data d2clean;
set dataset2;
length x1 $15;
x1 = compress(x, '', 'kd');
if not missing(x1) then do;
if input(x1, best32.) = 0 then do;
x1 = '0';
end;
else do;
x1 = cats(input(x1, best32.));
end;
end;
else do;
x1 = ''; /* Ensure blank if no digits */
end;
run;
PROC SQL;
/* Create the final transformed version of d1clean */
CREATE TABLE d1_final_cleaned AS
SELECT
d1.Product,
d1.x AS original_dataset1_x,
d1.x1 AS x1_after_initial_cleaning,
COALESCE(
( /* Subquery to find the shortest d2.x1 that is a suffix of d1.x1 (if d1.x1 is longer) */
SELECT MIN(d2.x1)
FROM d2clean d2
WHERE
LENGTH(TRIM(d2.x1)) > 0 AND
LENGTH(TRIM(d1.x1)) > LENGTH(TRIM(d2.x1)) AND
/* Check if d1.x1 ends with d2.x1 */
INDEX(TRIM(d1.x1), TRIM(d2.x1)) = (LENGTH(TRIM(d1.x1)) - LENGTH(TRIM(d2.x1)) + 1) AND
LENGTH(TRIM(d2.x1)) = (
SELECT MIN(LENGTH(TRIM(d2_inner.x1)))
FROM d2clean d2_inner
WHERE
LENGTH(TRIM(d2_inner.x1)) > 0 AND
LENGTH(TRIM(d1.x1)) > LENGTH(TRIM(d2_inner.x1)) AND
INDEX(TRIM(d1.x1), TRIM(d2_inner.x1)) = (LENGTH(TRIM(d1.x1)) - LENGTH(TRIM(d2_inner.x1)) + 1)
)
),
d1.x1
) AS x1
FROM d1clean d1;
/* Create the final transformed version of d2clean */
CREATE TABLE d2_final_cleaned AS
SELECT
d2.Name,
d2.x AS original_dataset2_x,
d2.x1 AS x1_after_initial_cleaning,
COALESCE(
( /* Subquery to find the shortest d1.x1 that is a suffix of d2.x1 (if d2.x1 is longer) */
SELECT MIN(d1.x1)
FROM d1clean d1
WHERE
LENGTH(TRIM(d1.x1)) > 0 AND
LENGTH(TRIM(d2.x1)) > LENGTH(TRIM(d1.x1)) AND
INDEX(TRIM(d2.x1), TRIM(d1.x1)) = (LENGTH(TRIM(d2.x1)) - LENGTH(TRIM(d1.x1)) + 1) AND
LENGTH(TRIM(d1.x1)) = (
SELECT MIN(LENGTH(TRIM(d1_inner.x1)))
FROM d1clean d1_inner
WHERE
LENGTH(TRIM(d1_inner.x1)) > 0 AND
LENGTH(TRIM(d2.x1)) > LENGTH(TRIM(d1_inner.x1)) AND
INDEX(TRIM(d2.x1), TRIM(d1_inner.x1)) = (LENGTH(TRIM(d2.x1)) - LENGTH(TRIM(d1_inner.x1)) + 1)
)
),
d2.x1
) AS x1
FROM d2clean d2;
QUIT;
/* Turn options back on if you need them for subsequent steps */
/* OPTIONS NOTES STIMER SOURCE SYNTAXCHECK; */
/* Display the results */
PROC PRINT DATA=d1_final_cleaned NOOBS;
TITLE "d1_final_cleaned: x1 transformed based on d2clean suffixes";
RUN;
PROC PRINT DATA=d2_final_cleaned NOOBS;
TITLE "d2_final_cleaned: x1 transformed based on d1clean suffixes";
RUN;
So How do I know the missing value of dataset1 match 2 of dataset1 ,not 3 or 4 of dataset2 ?
data dataset1;
input Product $ x $15.;
datalines;
via1 .
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
;
run;
/* Create dataset2 */
data dataset2;
input Name $ x $15.;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
;
run;
proc sql;
create table temp as
select a.Product,a.x as a_x,b.name,b.x as b_x
from dataset1 as a,dataset2 as b
where strip(reverse(a.x)) eqt strip(reverse(b.x))
group by a.x
having length(b.x)=max(length(b.x))
order by b.name;
create table want as
select * from temp
union corr
/*Match missing value of dataset1 to 2 of dataset1*/
select *
from (select product as Product,x as a_x from dataset1 where x is missing),
(select name as name,x as b_x from dataset2 where b_x not in (select b_x from temp))
;
quit;
This can be done in a single data step that first reads DATASET1, stores it in a hash object (lookup table) with a lookup key created by modifying X (remove leading non-digits, shorten it to the last 5 digits if necessary, and remove leading zeroes. This is followed by reading DATASET2, where X is similarly modified, and a lookup is performed to see whether it is in the hash object, from which the PRODUCT value is retrieved:
data dataset1 (label='x with 3 or 5 digits');
infile datalines missover;
input product $4. x :$20. ;
datalines;
via1
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
run;
data dataset2;
infile datalines ;
input name $1. x :$20. ;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
run;
data want (drop=rc);;
set dataset1 (in=in1) dataset2 (in=in2);
where x^='';
if _n_=1 then do;
declare hash d1 ();
d1.definekey('x');
d1.definedata('product');
d1.definedone();
end;
x=substr(x,anydigit(x)); /*Remove leading non-digits*/
if length(x)>5 then x=substr(x,length(x)-4); /*If too long, take last 5 digits*/
do while (x=:'0'); /* Strip leading zeroes*/
x=substr(x,2);
end;
if in1=1 then d1.add();
if in2;
rc=d1.find();
run;
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.