Solved: How to ensure like variables have the same length before combining dat...

Mscarboncopy · Posted 05-20-2019 11:21 AM

I have 8 data sets to merge and each has 2.000 variables.

Of these 2.000 variables I have a lot of string variables and they are not the same length.

I am getting truncated string variables in the merged file.

Someone told me to list the file that has the longest strings first on the merge statement. However, I don't have any one file that has all the string variables with the longest lengths. One file might have the longest strings for a few variables but shortest for others.

How can I change all of the string lengths to be the same (and the longest possible) before I merge these files?

Thank you.

FreelanceReinh · Posted 05-20-2019 12:04 PM

Hello @Mscarboncopy,

Here's an example of how you could do it:

/* Create test datasets for demonstration */

data have1; length id 4 a $1 b $2 c $3; id=1; a='f';   b='gh';  c='ijk';    run;
data have2; length id 5 a $2 b $3 c $1; id=2; a='lm';  b='nop'; c='q'; d=1; run;
data have3; length id 6 a $3 b $1 c $2; id=3; a='stu'; b='v';   c='wx';     run;

/* Create an empty template dataset with maximum length for each character variable
   and length 8 for each numeric variable */

proc sql;
create table tmp as
select * from have1 where 0
outer union corr
select * from have2 where 0
outer union corr
select * from have3 where 0;
quit;

/* Merge datasets after setting up the PDV with the template dataset */

data want;
if 0 then set tmp;
merge have1-have3;
by id;
run;

Important edit: Changed UNION to OUTER UNION CORR (and removed nowarn option) in the PROC SQL step. Reason: Without "CORR(ESPONDING)" the columns would be aligned by position, not by variable name! The UNION operator with the CORR option would shed the "leftover" column D from dataset HAVE2, whereas the OUTER UNION operator includes it.

View solution in original post

FreelanceReinh · Posted 05-20-2019 12:04 PM

Hello @Mscarboncopy,

Here's an example of how you could do it:

/* Create test datasets for demonstration */

data have1; length id 4 a $1 b $2 c $3; id=1; a='f';   b='gh';  c='ijk';    run;
data have2; length id 5 a $2 b $3 c $1; id=2; a='lm';  b='nop'; c='q'; d=1; run;
data have3; length id 6 a $3 b $1 c $2; id=3; a='stu'; b='v';   c='wx';     run;

/* Create an empty template dataset with maximum length for each character variable
   and length 8 for each numeric variable */

proc sql;
create table tmp as
select * from have1 where 0
outer union corr
select * from have2 where 0
outer union corr
select * from have3 where 0;
quit;

/* Merge datasets after setting up the PDV with the template dataset */

data want;
if 0 then set tmp;
merge have1-have3;
by id;
run;

Important edit: Changed UNION to OUTER UNION CORR (and removed nowarn option) in the PROC SQL step. Reason: Without "CORR(ESPONDING)" the columns would be aligned by position, not by variable name! The UNION operator with the CORR option would shed the "leftover" column D from dataset HAVE2, whereas the OUTER UNION operator includes it.

noling · Posted 05-20-2019 02:18 PM

Does the union automatically select the maximum length? Pretty neat!

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

FreelanceReinh · Posted 05-20-2019 02:59 PM

@noling wrote:

Does the union automatically select the maximum length? Pretty neat!

Yes, this is true, also for the OUTER UNION operator, which I should have used in conjunction with the CORR option instead of UNION (without CORR) to make the code much more robust (see updated post). Thank you for making me rethink my suggestion!

Mscarboncopy · Posted 05-21-2019 08:56 AM

Thank you. I will try this out and let you know if it worked.

It looks simple enough and it is what I was hoping to get.

Mscarboncopy · Posted 05-21-2019 09:22 AM

It works. It is pretty amazing. I have a question though, when I look at the merged file I see that the string lengths are not set to one max length but the max for each variable for the files merged (probably), which makes sense since it is all that matters, to set to the max needed based on what you have.

Thank you again. This was exactly what I needed.

FreelanceReinh · Posted 05-21-2019 01:45 PM

@Mscarboncopy wrote:

It works. It is pretty amazing. I have a question though, when I look at the merged file I see that the string lengths are not set to one max length but the max for each variable for the files merged (probably), which makes sense since it is all that matters, to set to the max needed based on what you have.

Glad to hear that my suggestion worked for you. Yes, with the CORR (or CORRESPONDING) option the OUTER UNION operator aligns like-named columns. So, in particular each of the character variables in question is handled separately. The creation of table TMP requires that variable, say, CVAR1 in TMP has a defined length. To avoid truncation, PROC SQL uses the maximum of the lengths of CVAR1 found in datasets HAVE1, HAVE2 and HAVE3 (but it is not necessary that all three datasets contain CVAR1). This is a major difference to the analogous DATA step statement "set have1-have3;", which would use the length found in the first dataset containing CVAR1 and only throw a warning ("Multiple lengths were specified ...") in the log if CVAR1 in a later dataset had a greater length.

What OUTER UNION CORR does take from the first possible dataset (like the DATA step SET statement) are formats, informats and labels. (Clearly, there are no obvious ideal choices for these.) In situations where even variable lengths are inconsistent I've often seen that those other variable attributes are inconsistent as well. Therefore, I'd recommend that you ensure that the formats, informats and labels (if any) of the aligned variables in dataset WANT are appropriate. Note that a format such as $w. with w less than the length of the variable could lead to truncated values in a report, even though the full-length string is actually available in the variable. For example, if variable B in my dataset HAVE1 had format $2. the value 'nop' from HAVE2 would be displayed as 'no' in PROC PRINT output of dataset WANT.

Honestly, I don't think I've ever used my suggested technique (or have seen it elsewhere) in practice to harmonize variable lengths. I just vaguely remembered that PROC SQL behaves better than the DATA step when it comes to aligning variables with different lengths. So, to create the "template dataset" TMP I had envisioned, I needed to select a suitable PROC SQL technique for combining the "HAVE" datasets and first settled for the UNION operator. Thanks to @noling's question I noticed that this (without the CORR option) could have led to error messages (if numeric and character variables had the same position in different datasets) and, even worse, to truncation of character values (if different character variables had the same position; the final DATA step would have issued a warning, though). That's how I finally ended up with OUTER UNION CORR.

tsap · Posted 05-20-2019 04:05 PM

Utilizing these test datasets:

DATA TESTDATA_1;
FORMAT 		ID 8. DATE DATE9. Code $7.;
INFORMAT	ID 8. DATE DATE9. Code $7.; 
INPUT  		ID    DATE 	   Code; 
CARDS;
100	13MAY2019	HCC-5
100	13MAY2019	HCC-9
100	13MAY2019	HCC-36
101	13MAY2019	HCC-4
101	13MAY2019	HCC-101
;

DATA TESTDATA_2;
FORMAT 		ID 8. DATE DATE9. Code $5.;
INFORMAT	ID 8. DATE DATE9. Code $5.; 
INPUT  		ID    DATE 	   Code; 
CARDS;
100	14JUN2019	HCC-5
100	14JUN2019	HCC-9
101	14JUN2019	HCC-4
;

I created two tables that contained the same fields but the "Code" variable is different lengths/formats.

Then I combined the two tables so that I could see the values get truncated.

DATA WORK.Combine_PreFix;
	SET WORK.TESTDATA_2
	    WORK.TESTDATA_1;
RUN;

Name	Type	        Length	  Format	Informat
ID	Numeric	        8	  8.	        8.	
DATE	Date	        8	  DATE9.	DATE9.	
Code	Character	5	  $5.	        $5.

A warning message was generated:

WARNING: Multiple lengths were specified for the variable Code by input data set(s). This can cause truncation of data.

And this table was the result (where we can clearly see the values were truncated):

ID	DATE	        Code
100	14JUN2019	HCC-5
100	14JUN2019	HCC-9
101	14JUN2019	HCC-4
100	13MAY2019	HCC-5
100	13MAY2019	HCC-9
100	13MAY2019	HCC-3
101	13MAY2019	HCC-4
101	13MAY2019	HCC-1

So I put together this logic, to identify the max(length) for every value and their formats, and then put them into a macro containing a running list for all of the variables on the tables.

PROC SQL NOPRINT;
	SELECT DISTINCT
		  CAT(STRIP(a.Name), ' ', STRIP(CASE b.TYPE WHEN 'num' THEN ' '	 WHEN 'char' THEN '$'  ELSE ' ' END), a.Mlen)
		, CAT(STRIP(a.Name), ' ', b.Format)
		, a.VarNum
	INTO  :LngthDtls SEPARATED BY ' '
		, :FrmtDtls  SEPARATED BY ' '
		, :Varnum
	FROM 		(SELECT Libname, VarNum, Name, MAX(Length) AS Mlen
				 FROM SASHELP.VCOLUMN	
				 WHERE LIBNAME='WORK'
				 AND MemName IN("TESTDATA_1", "TESTDATA_2")
				 GROUP BY Libname, Name, VarNum)	AS a		
	INNER JOIN	SASHELP.VCOLUMN						AS b	ON  a.LIBNAME=b.LIBNAME
															AND a.Name=b.Name
				 											AND a.Mlen=b.Length
	WHERE b.LIBNAME='WORK'
	AND b.MemName IN("TESTDATA_1", "TESTDATA_2")
	ORDER BY a.VarNum;
QUIT;

*I only include the VarNum details, in an attempt to retain the variable order as they currently appear in the tables reviewed).

To test the newly created macro variables:

%PUT &=FrmtDtls.;
%PUT &=LngthDtls.;

FRMTDTLS=ID 8. DATE DATE9. Code $7.

LNGTHDTLS=ID 8 DATE 8 Code $7

*(as per the log).

Then we use this simple datastep to set the details (using the macros we just generated for all variables involved):

DATA WORK.Combine_PostFix;
	LENGTH &LNGTHDtls.;
	FORMAT &FrmtDtls.;
	SET WORK.TESTDATA_2
		WORK.TESTDATA_1;
RUN;

Which produces no warning message, and this resulting output table:

ID	DATE	       Code
100	14JUN2019	HCC-5
100	14JUN2019	HCC-9
101	14JUN2019	HCC-4
100	13MAY2019	HCC-5
100	13MAY2019	HCC-9
100	13MAY2019	HCC-36
101	13MAY2019	HCC-4
101	13MAY2019	HCC-101

Name	Type	     Length	Format	  Informat
ID	Numeric	     8	        8.	  8.	
DATE	Date	     8	        DATE9.	  DATE9.	
Code	Character    7	        $7.	  $7.

And now we can clearly see that none of the values are being truncated.

For the purposes of your process, you will need to populated the libname and memname values in the where criteria locations with your specific libname and table names. Personally I would wrap this logic into a Macro statement with some positional values. And if the tables being used sometime fluctuates, I would use additional logic to generate a flexible macro list of all the table names that should be included (depending on daily/weekly/monthly criteria, etc). I didn't want to over complicate the answer that I provide with all of that information.

Hope this helps.

Mscarboncopy · Posted 05-21-2019 08:54 AM

Thank you. I will be trying this out.

Is there a reason why you have date variables ?

It thew me off a little.

The code would be easier with only "id" and "code".

I don't have to do anything with my date variables just the strings. Could I just remove the lines with the date statements?

tsap · Posted 05-21-2019 10:31 AM

There were no specific details in your initial post about the variable characteristics between the tables you are combining, so I included a Date formatted field.

Yes, you can remove the lines where it references dates. I will say, the actual logic that you would use doesn't include any references to 'date'. The only places where you will see 'date' is when I am creating the two tables with sample test data, and the output table/details.

The benefit to the code I provided is that it doesn't matter what format the fields are, it will identify them appropriately and apply the necessary details when combining your own datasets.

Let me know if that doesn't make sense.

Mscarboncopy · Posted 05-21-2019 10:53 AM

Thank you.

Reeza · Posted 05-20-2019 07:33 PM

Can you go back and control that at the data import stage. For example, this is a common occurrence when you've used PROC IMPORT instead of writing a data step input, especially if the files have the same structure. In general, this is the best approach and why it's worth spending time up front ensuring your data is read in correctly.

Mscarboncopy · Posted 05-21-2019 08:50 AM

That would be ideal but I am not sure how to do that.

I usually have spss files with the different lengths (I can't change that, our program generates these files that way), I save them in sas to clean and merge them in sas.

How could I use PROC IMPORT in that case ?

I have never done that.

Thank you.

ScottBass · Posted 05-21-2019 12:15 AM

Perhaps you can run with this:

data a;
   length a1-a2000 $10;
   call missing(of _all_);
run;
data b;
   length a1-a2000 $50;
   call missing(of _all_);
run;
data c;
   length a1-a2000 $20;
   call missing(of _all_);
run;

proc sql;
   create table columns as
   select name,max(length) as max_length
   from dictionary.columns
   where upcase(libname)='WORK' and upcase(memname) in ('A','B','C') and type='char'
   group by name
   ;
quit;

%macro code;
%let name=%trim(&name);
%let length=%trim(&max_length);
attrib &name length=$&length;
%mend;

data test;
   length in_ds $32;
   * set PDV order ;
   format a1-a2000;
   * set PDV attributes ;
   %loop_control(control=columns)
   set 
      a
      b
      c
      indsname=dsname
   ;
   in_ds=dsname;
run;

https://github.com/scottbass/SAS/blob/master/Macro/loop_control.sas

https://github.com/scottbass/SAS/blob/master/Macro/parmv.sas

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Patrick · Posted 05-21-2019 08:56 AM

@ScottBass

Using an OUTER UNION CORR as @FreelanceReinh proposes is such a simple approach for "length alignment".

Why would you want to replace that with any other coding option if there aren't additional requirements you need to address?

How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Re: How to ensure like variables have the same length before combining data sets

Registration is open

SAS Training: Just a Click Away