topic Re: An efficient method to merge datasets with multiple length for the character variables in SAS Programming

An efficient method to concatenating/stac datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 12:49:06 GMT

I want to merge multiple datasets with multiple variables but the length of the character variables are different in each dataset. I created an example that works but I want to know how I can improve it to make it easier to apply when the number of datasets is 5-10 with 20 variables to check for length.

Thanks for your thoughts

data d1;
	length var1 $10. var2 $10.;
	input var1 var2;
	cards;
apple carrots
orange cabbage
;

data d2;
	length var1 $30. var2 $30.;
	input var1 var2;
	datalines;
apple-pie-recipe  carrot-soup
orange-juice-for-breakfast cabbage-soup-for-supper
;

data d1_i (drop= var1_old var2_old);
	set d1(rename= (var1=var1_old var2=var2_old));
	length var1 $30.;
	var1=var1_old;
	length var2 $30.;
	var2=var2_old;
	run;

data want;
	set d1_i d2;
	run;

Re: An efficient method to merge datasets with multiple length for the character variables

Tom — Thu, 16 Sep 2021 04:30:25 GMT

Find the maximum lengths and just add a length statement before the SET.

So something like:

data want;
  length var1 $30 var2 $30 ;
  set d1 d2;
run;

You can generate that by querying the metadata for the source tables.

proc sql noprint;
select distinct catx(' ',name,cats('$',length))
  into :lengths separated by ' '
from
  (select upcase(name) as unique_name,min(name) as name,max(length) as length
  from dictionary.columns
  where libname='WORK' 
    and memname in ('D1' 'D2')
    and type='char'
  group by unique_name
  )
;
%let nchar=&sqlobs;
quit;

And then you can add that LENGTH statement to your step that combines the dataset.

You might also want to remove any formats that are attached to prevent things like a variable with length 200 using a format of only $30. which will cause the values to appear truncated even when they are not.

data want;
%if &nchar %then %do;
  length &lengths;
  format _character_ ;
%end;
  set d1 d2;
run;

PS You don't need to include a period when specifying the length of a variable. Variables can only have integer lengths.

Re: An efficient method to merge datasets with multiple length for the character variables

Tom — Thu, 16 Sep 2021 04:32:33 GMT

It would probably be better to create the original dataset with consistent lengths for the variables.

What is the source of those dataset and why does the same variable end up having a different length in different datasets?

Re: An efficient method to merge datasets with multiple length for the character variables

Kurt_Bremser — Thu, 16 Sep 2021 07:06:01 GMT

Be careful with your wording. In SAS parlance, merge means putting data side-by-side. What you want is concatenating, stacking or appending datasets.

Re: An efficient method to merge datasets with multiple length for the character variables

Ksharp — Thu, 16 Sep 2021 12:34:34 GMT

The best way is using PROC SQL.

data d1;
	length var1 $10. var2 $10.;
	input var1 var2;
	cards;
apple carrots
orange cabbage
;

data d2;
	length var1 $30. var2 $30.;
	input var1 var2;
	datalines;
apple-pie-recipe  carrot-soup
orange-juice-for-breakfast cabbage-soup-for-supper
;

proc sql;
create table want as
select * from d1 
union all corr
select * from d2;
quit;

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 12:41:49 GMT

Woe, thank you @Ksharp ! I am glad I asked. It is much easier than changing the length one by one!

Am I right that

union all corr

does the job of increasing the length to the longer one automatically? Thanks!

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 12:50:29 GMT

Thank you KurtBremser. I changed the title so that is right for the future users.

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 12:52:21 GMT

I agree! The datasets were created from different sources and by different individuals. then when I imported them to SAS from SPSS/EXCELL the length is assigned by SAS. Is there any way I can prevent this in the future? Thanks

Re: An efficient method to merge datasets with multiple length for the character variables

Ksharp — Thu, 16 Sep 2021 12:57:47 GMT

Correct !

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 12:58:19 GMT

Thanks!

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 14:20:34 GMT

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 13:22:55 GMT

@Ksharp May I please ask one more question. How do I modify the code to stack 5 datasets? I tried adding lines like

select * from d3;
select * from d4;
select * from d5;

but that did not work. Any hint to avoid adding datasets altogether instead of one at a time? Thanks!

Re: An efficient method to merge datasets with multiple length for the character variables

Ksharp — Thu, 16 Sep 2021 13:36:54 GMT

Stack these dataset by UNION operator.

proc sql;
create table want as
select * from d1
union all corr
select * from d2
union all corr
select * from d3
union all corr
select * from d4
union all corr
select * from d5
;
quit;

Re: An efficient method to merge datasets with multiple length for the character variables

Tom — Thu, 16 Sep 2021 13:59:44 GMT

PROC IMPORT of an SPSS files should replicate the structure in the SPSS file pretty well.

But PROC IMPORT of an EXCEL file cannot since there is no concept of a variable in EXCEL. Every cell can be totally independent from each other cell.

If you are getting files from EXCEL you will have more control if they files are delivered as delimited text files. Then you can write your own data step in SAS to read file so that you have full control over the type, length, name and other attributes of the variables.

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 14:33:00 GMT

@Ksharp I hope this is the last question. How can I keep the variables in each dataset that do not match (other than var1 and var2) but I want to keep in the stacked dataset? Thanks!!!

data d1;
	length var1 $10. var2 $10. var3 8;
	input var1 var2 var3;
	cards;
apple carrots 99
orange cabbage 88
;

data d2;
	length var1 $30. var2 $30. var4 8 var5 8;
	input var1 var2 var4 var5;
	datalines;
apple-pie-recipe  carrot-soup 100 111
orange-juice-for-breakfast cabbage-soup-for-supper 200 222
;

proc sql;
create table want as
select * from d1 
union all corr
select * from d2;
quit;

Re: An efficient method to merge datasets with multiple length for the character variables

Emma_at_SAS — Thu, 16 Sep 2021 16:53:04 GMT

I found the solution for the case when we have non-matching variables in each dataset:

proc sql;
create table want as
select * from d1 
OUTER UNION CORR
select * from d2;
quit;

Thanks for introducing this PROC SQL procedure. It was very helpful 🙂