I have 8 data sets to merge and each has 2.000 variables.
Of these 2.000 variables I have a lot of string variables and they are not the same length.
I am getting truncated string variables in the merged file.
Someone told me to list the file that has the longest strings first on the merge statement. However, I don't have any one file that has all the string variables with the longest lengths. One file might have the longest strings for a few variables but shortest for others.
How can I change all of the string lengths to be the same (and the longest possible) before I merge these files?
Thank you.
Hello @Mscarboncopy,
Here's an example of how you could do it:
/* Create test datasets for demonstration */
data have1; length id 4 a $1 b $2 c $3; id=1; a='f'; b='gh'; c='ijk'; run;
data have2; length id 5 a $2 b $3 c $1; id=2; a='lm'; b='nop'; c='q'; d=1; run;
data have3; length id 6 a $3 b $1 c $2; id=3; a='stu'; b='v'; c='wx'; run;
/* Create an empty template dataset with maximum length for each character variable
and length 8 for each numeric variable */
proc sql;
create table tmp as
select * from have1 where 0
outer union corr
select * from have2 where 0
outer union corr
select * from have3 where 0;
quit;
/* Merge datasets after setting up the PDV with the template dataset */
data want;
if 0 then set tmp;
merge have1-have3;
by id;
run;
Important edit: Changed UNION to OUTER UNION CORR (and removed nowarn option) in the PROC SQL step. Reason: Without "CORR(ESPONDING)" the columns would be aligned by position, not by variable name! The UNION operator with the CORR option would shed the "leftover" column D from dataset HAVE2, whereas the OUTER UNION operator includes it.
Hello @Mscarboncopy,
Here's an example of how you could do it:
/* Create test datasets for demonstration */
data have1; length id 4 a $1 b $2 c $3; id=1; a='f'; b='gh'; c='ijk'; run;
data have2; length id 5 a $2 b $3 c $1; id=2; a='lm'; b='nop'; c='q'; d=1; run;
data have3; length id 6 a $3 b $1 c $2; id=3; a='stu'; b='v'; c='wx'; run;
/* Create an empty template dataset with maximum length for each character variable
and length 8 for each numeric variable */
proc sql;
create table tmp as
select * from have1 where 0
outer union corr
select * from have2 where 0
outer union corr
select * from have3 where 0;
quit;
/* Merge datasets after setting up the PDV with the template dataset */
data want;
if 0 then set tmp;
merge have1-have3;
by id;
run;
Important edit: Changed UNION to OUTER UNION CORR (and removed nowarn option) in the PROC SQL step. Reason: Without "CORR(ESPONDING)" the columns would be aligned by position, not by variable name! The UNION operator with the CORR option would shed the "leftover" column D from dataset HAVE2, whereas the OUTER UNION operator includes it.
Does the union automatically select the maximum length? Pretty neat!
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
@noling wrote:
Does the union automatically select the maximum length? Pretty neat!
Yes, this is true, also for the OUTER UNION operator, which I should have used in conjunction with the CORR option instead of UNION (without CORR) to make the code much more robust (see updated post). Thank you for making me rethink my suggestion!
Thank you. I will try this out and let you know if it worked.
It looks simple enough and it is what I was hoping to get.
It works. It is pretty amazing. I have a question though, when I look at the merged file I see that the string lengths are not set to one max length but the max for each variable for the files merged (probably), which makes sense since it is all that matters, to set to the max needed based on what you have.
Thank you again. This was exactly what I needed.
@Mscarboncopy wrote:
It works. It is pretty amazing. I have a question though, when I look at the merged file I see that the string lengths are not set to one max length but the max for each variable for the files merged (probably), which makes sense since it is all that matters, to set to the max needed based on what you have.
Glad to hear that my suggestion worked for you. Yes, with the CORR (or CORRESPONDING) option the OUTER UNION operator aligns like-named columns. So, in particular each of the character variables in question is handled separately. The creation of table TMP requires that variable, say, CVAR1 in TMP has a defined length. To avoid truncation, PROC SQL uses the maximum of the lengths of CVAR1 found in datasets HAVE1, HAVE2 and HAVE3 (but it is not necessary that all three datasets contain CVAR1). This is a major difference to the analogous DATA step statement "set have1-have3;", which would use the length found in the first dataset containing CVAR1 and only throw a warning ("Multiple lengths were specified ...") in the log if CVAR1 in a later dataset had a greater length.
What OUTER UNION CORR does take from the first possible dataset (like the DATA step SET statement) are formats, informats and labels. (Clearly, there are no obvious ideal choices for these.) In situations where even variable lengths are inconsistent I've often seen that those other variable attributes are inconsistent as well. Therefore, I'd recommend that you ensure that the formats, informats and labels (if any) of the aligned variables in dataset WANT are appropriate. Note that a format such as $w. with w less than the length of the variable could lead to truncated values in a report, even though the full-length string is actually available in the variable. For example, if variable B in my dataset HAVE1 had format $2. the value 'nop' from HAVE2 would be displayed as 'no' in PROC PRINT output of dataset WANT.
Honestly, I don't think I've ever used my suggested technique (or have seen it elsewhere) in practice to harmonize variable lengths. I just vaguely remembered that PROC SQL behaves better than the DATA step when it comes to aligning variables with different lengths. So, to create the "template dataset" TMP I had envisioned, I needed to select a suitable PROC SQL technique for combining the "HAVE" datasets and first settled for the UNION operator. Thanks to @noling's question I noticed that this (without the CORR option) could have led to error messages (if numeric and character variables had the same position in different datasets) and, even worse, to truncation of character values (if different character variables had the same position; the final DATA step would have issued a warning, though). That's how I finally ended up with OUTER UNION CORR.
Utilizing these test datasets:
DATA TESTDATA_1;
FORMAT ID 8. DATE DATE9. Code $7.;
INFORMAT ID 8. DATE DATE9. Code $7.;
INPUT ID DATE Code;
CARDS;
100 13MAY2019 HCC-5
100 13MAY2019 HCC-9
100 13MAY2019 HCC-36
101 13MAY2019 HCC-4
101 13MAY2019 HCC-101
;
DATA TESTDATA_2;
FORMAT ID 8. DATE DATE9. Code $5.;
INFORMAT ID 8. DATE DATE9. Code $5.;
INPUT ID DATE Code;
CARDS;
100 14JUN2019 HCC-5
100 14JUN2019 HCC-9
101 14JUN2019 HCC-4
;
I created two tables that contained the same fields but the "Code" variable is different lengths/formats.
Then I combined the two tables so that I could see the values get truncated.
DATA WORK.Combine_PreFix;
SET WORK.TESTDATA_2
WORK.TESTDATA_1;
RUN;
Name Type Length Format Informat
ID Numeric 8 8. 8.
DATE Date 8 DATE9. DATE9.
Code Character 5 $5. $5.
A warning message was generated:
WARNING: Multiple lengths were specified for the variable Code by input data set(s). This can cause truncation of data.
And this table was the result (where we can clearly see the values were truncated):
ID DATE Code
100 14JUN2019 HCC-5
100 14JUN2019 HCC-9
101 14JUN2019 HCC-4
100 13MAY2019 HCC-5
100 13MAY2019 HCC-9
100 13MAY2019 HCC-3
101 13MAY2019 HCC-4
101 13MAY2019 HCC-1
So I put together this logic, to identify the max(length) for every value and their formats, and then put them into a macro containing a running list for all of the variables on the tables.
PROC SQL NOPRINT;
SELECT DISTINCT
CAT(STRIP(a.Name), ' ', STRIP(CASE b.TYPE WHEN 'num' THEN ' ' WHEN 'char' THEN '$' ELSE ' ' END), a.Mlen)
, CAT(STRIP(a.Name), ' ', b.Format)
, a.VarNum
INTO :LngthDtls SEPARATED BY ' '
, :FrmtDtls SEPARATED BY ' '
, :Varnum
FROM (SELECT Libname, VarNum, Name, MAX(Length) AS Mlen
FROM SASHELP.VCOLUMN
WHERE LIBNAME='WORK'
AND MemName IN("TESTDATA_1", "TESTDATA_2")
GROUP BY Libname, Name, VarNum) AS a
INNER JOIN SASHELP.VCOLUMN AS b ON a.LIBNAME=b.LIBNAME
AND a.Name=b.Name
AND a.Mlen=b.Length
WHERE b.LIBNAME='WORK'
AND b.MemName IN("TESTDATA_1", "TESTDATA_2")
ORDER BY a.VarNum;
QUIT;
*I only include the VarNum details, in an attempt to retain the variable order as they currently appear in the tables reviewed).
To test the newly created macro variables:
%PUT &=FrmtDtls.;
%PUT &=LngthDtls.;
FRMTDTLS=ID 8. DATE DATE9. Code $7.
LNGTHDTLS=ID 8 DATE 8 Code $7
*(as per the log).
Then we use this simple datastep to set the details (using the macros we just generated for all variables involved):
DATA WORK.Combine_PostFix;
LENGTH &LNGTHDtls.;
FORMAT &FrmtDtls.;
SET WORK.TESTDATA_2
WORK.TESTDATA_1;
RUN;
Which produces no warning message, and this resulting output table:
ID DATE Code
100 14JUN2019 HCC-5
100 14JUN2019 HCC-9
101 14JUN2019 HCC-4
100 13MAY2019 HCC-5
100 13MAY2019 HCC-9
100 13MAY2019 HCC-36
101 13MAY2019 HCC-4
101 13MAY2019 HCC-101
Name Type Length Format Informat
ID Numeric 8 8. 8.
DATE Date 8 DATE9. DATE9.
Code Character 7 $7. $7.
And now we can clearly see that none of the values are being truncated.
For the purposes of your process, you will need to populated the libname and memname values in the where criteria locations with your specific libname and table names. Personally I would wrap this logic into a Macro statement with some positional values. And if the tables being used sometime fluctuates, I would use additional logic to generate a flexible macro list of all the table names that should be included (depending on daily/weekly/monthly criteria, etc). I didn't want to over complicate the answer that I provide with all of that information.
Hope this helps.
Thank you. I will be trying this out.
Is there a reason why you have date variables ?
It thew me off a little.
The code would be easier with only "id" and "code".
I don't have to do anything with my date variables just the strings. Could I just remove the lines with the date statements?
There were no specific details in your initial post about the variable characteristics between the tables you are combining, so I included a Date formatted field.
Yes, you can remove the lines where it references dates. I will say, the actual logic that you would use doesn't include any references to 'date'. The only places where you will see 'date' is when I am creating the two tables with sample test data, and the output table/details.
The benefit to the code I provided is that it doesn't matter what format the fields are, it will identify them appropriately and apply the necessary details when combining your own datasets.
Let me know if that doesn't make sense.
Thank you.
That would be ideal but I am not sure how to do that.
I usually have spss files with the different lengths (I can't change that, our program generates these files that way), I save them in sas to clean and merge them in sas.
How could I use PROC IMPORT in that case ?
I have never done that.
Thank you.
Perhaps you can run with this:
data a;
length a1-a2000 $10;
call missing(of _all_);
run;
data b;
length a1-a2000 $50;
call missing(of _all_);
run;
data c;
length a1-a2000 $20;
call missing(of _all_);
run;
proc sql;
create table columns as
select name,max(length) as max_length
from dictionary.columns
where upcase(libname)='WORK' and upcase(memname) in ('A','B','C') and type='char'
group by name
;
quit;
%macro code;
%let name=%trim(&name);
%let length=%trim(&max_length);
attrib &name length=$&length;
%mend;
data test;
length in_ds $32;
* set PDV order ;
format a1-a2000;
* set PDV attributes ;
%loop_control(control=columns)
set
a
b
c
indsname=dsname
;
in_ds=dsname;
run;
https://github.com/scottbass/SAS/blob/master/Macro/loop_control.sas
https://github.com/scottbass/SAS/blob/master/Macro/parmv.sas
Using an OUTER UNION CORR as @FreelanceReinh proposes is such a simple approach for "length alignment".
Why would you want to replace that with any other coding option if there aren't additional requirements you need to address?
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.