I have 3 large datasets, each with normalized naming for 61 attributes, that I need to join into one. When I had tried a DATA STEP I realized it was truncating values based on the contents of the length of any give attribute in the first data set. After some research I stumbled on upon OUTER UNION CORRESPONDING that seemed to solve my problem.
However, I just uncovered a situation, this is just one example, where this evidently isn't working as I had expected. While the attribute naming is consistent the data within is not so when I join them together the truncating is still occurring. Do I have to continually evaluate all 61 attributes? is there a systematic way to produce this reliably every month?
These are large files that we create monthly (millions of records each) so I can't troubleshoot every nuance. How can I get around this to ensure that all attributes join with the full contents of their values? I'm using SAS EG 9.4. and this is a slightly urgent matter as I have a deadline on this.
My basic code is below and I've attached examples of the output. Thank you for any advise you can offer!
PROC SQL;
CREATE TABLE work.ELIG_DATA AS
SELECT * FROM &CRM_FILE.
OUTER UNION CORRESPONDING SELECT * FROM &UDW_FILE.
OUTER UNION CORRESPONDING SELECT * FROM &ELIG_DATA.
;
QUIT;
/*WORK.ELIG_DATA created, with 7990414 rows and 61 columns.*/
From what I understand the root of the problem is that your source tables don't have consistent attributes for the same named variables. Ideally that's the place where you fix the problem. If you can avoid it don't use Proc Import but use a SAS data step /infile/input as this will guarantee that your variables have always same attributes like length, format etc.
Using a outer union corr (or union corr all) set operation the length of the output variable will be the max length of the variable from any of the source tables. There won't be string truncation and I'm pretty sure you don't see any truncation warnings in your log.
The issue you encounter is: Other variable attributes like formats are taken from the first source table. The formats determine how your variable values get displayed. If a character format is shorter than the actual string then what gets printed is only the string up to the length of the format and though the string looks truncated (the internal value isn't).
The quick fix could be to just remove all formats from character variables. Below sample code for table work.want.
proc datasets lib=work memtype=data nolist;
modify want;
attrib _character_ format=;
run;
quit;
Here some code to demonstrate what I'm talking about:
data work.have1;
attrib var1 length=$5 format=$5.;
var1='12345';
run;
data work.have2;
attrib var1 length=$10;
var1='1234567890';
run;
proc sql;
create table work.want as
select * from work.have1
outer union corr
select * from work.have2
;
quit;
proc contents data=work.want varnum;
run;
proc print data=work.want;
run;
The Proc Contents report shows you that the length has been taken from the 2nd source table but the format comes from the first source table
And because the format is $5 only the first 5 characters get printed (displayed)
If you remove the format then all characters in the string get printed.
proc datasets lib=work memtype=data nolist;
modify want;
attrib _character_ format=;
run;
quit;
proc print data=work.want;
run;
If you can't already create the source tables with consistent variable attributes then the 2nd best option would be to pre-define a table structure so whatever variations there are in your source tables the target table gets always the same stable set of variables and attributes.
To do so you can create a zero row table with all the variables and then use this table as first source in your SQL or Proc Append or data step.
Here some sample code how to create and use such a zero row table with a SQL union.
data work.table_structure;
attrib
var1 length=$20
;
stop;
run;
proc sql;
create table work.want as
select * from work.table_structure
outer union corr
select * from work.have1
outer union corr
select * from work.have2
;
quit;
proc contents data=work.want;
run;
You haven't been very clear about explaining what is truncated. Which variable(s), from which source table(s), and which values.
Output without the input is often not sufficient.
Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.
Suggestion: Provide very small examples of each of the data sets that exhibit this behavior. Provide the data in the form of working data step code. And don't hide which data set is behind which macro variable.
It might help to show the log from running that join. Include the code and all the notes associated.
I am also not at all sure what you mean by attributes, much less "all 61" as I am not opening that spread sheet to see what an "attribute" may be. Since most SAS data sets have at most data length, type, label, format, informat as attributes I strongly suspect you mean something else.
I did include an example of what's being truncated in the Excel file I attached. I can't share any further detail because it includes PHI/HIPAA medical claims data. The Excel example is just two fields in the 3 datasets, and the outputs of the result based on how the join is performed. I've included it below in SAS code form.
Attributes(columns/variables/elements/fields) that make up a full record on my dataset. And the '61' means there are 61 columns/attributes on the record.
The log would not show anything as the join runs without error; the issue is the results and the values in the columns being truncated depending on which order the join is performed in. I'm trying to find a way to ensure that the values in the columns are NOT truncated. I mention the 61 because the example I shared is just 2 instances of this happening; it is likely that it is occurring in various combinations across the 3 datasets.
I hope that provides some clarity on what I'm trying to achieve.
proc sql;
create table work.test_UDW as
select distinct
SRC_SYS_CD,
STATUS
from &UDW_FILE.;
quit;
/*work.test_UDW
SRC_SYS_CD STATUS
UDW Active
UDW UMR
*/
proc sql;
create table work.test_CRM as
select distinct
SRC_SYS_CD,
STATUS
from &CRM_FILE.;
quit;
/*work.test_CRM
SRC_SYS_CD STATUS
CRM Cross Carrier
*/
proc sql;
create table work.test_GLX as
select distinct
SRC_SYS_CD,
STATUS
from &ELIG_DATA.;
quit;
/*work.test_GLX
SRC_SYS_CD STATUS
CE Active
CE New
*/
proc sql;
create table work.test_ALL as
select distinct
SRC_SYS_CD,
STATUS
from work.ELIG_DATA;
quit;
/*
Results when GLX is first in JOIN
SRC_SYS_CD STATUS
CE Active
CE New
CR Cross Carrier
UD Active
UD UMR
Results when UDW is first in JOIN
SRC_SYS_CD STATUS
CE Active
CE New
CRM Cross
UDW Active
UDW UMR
*/
I think you left out a step as I don't see any code that this statement could be applied to:
Results when GLX is first in JOIN
If you want share your example data then share it as a data step and not a listing. That way we know the variable types and lengths and whether there are any formats attached to the variables. It is trivial to convert your listings into a data step.
Example:
data udw ;
infile datalines dsd truncover;
input SRC_SYS_CD :$3. STATUS :$6. ;
datalines;
UDW,Active
UDW,UMR
;
From what I understand the root of the problem is that your source tables don't have consistent attributes for the same named variables. Ideally that's the place where you fix the problem. If you can avoid it don't use Proc Import but use a SAS data step /infile/input as this will guarantee that your variables have always same attributes like length, format etc.
Using a outer union corr (or union corr all) set operation the length of the output variable will be the max length of the variable from any of the source tables. There won't be string truncation and I'm pretty sure you don't see any truncation warnings in your log.
The issue you encounter is: Other variable attributes like formats are taken from the first source table. The formats determine how your variable values get displayed. If a character format is shorter than the actual string then what gets printed is only the string up to the length of the format and though the string looks truncated (the internal value isn't).
The quick fix could be to just remove all formats from character variables. Below sample code for table work.want.
proc datasets lib=work memtype=data nolist;
modify want;
attrib _character_ format=;
run;
quit;
Here some code to demonstrate what I'm talking about:
data work.have1;
attrib var1 length=$5 format=$5.;
var1='12345';
run;
data work.have2;
attrib var1 length=$10;
var1='1234567890';
run;
proc sql;
create table work.want as
select * from work.have1
outer union corr
select * from work.have2
;
quit;
proc contents data=work.want varnum;
run;
proc print data=work.want;
run;
The Proc Contents report shows you that the length has been taken from the 2nd source table but the format comes from the first source table
And because the format is $5 only the first 5 characters get printed (displayed)
If you remove the format then all characters in the string get printed.
proc datasets lib=work memtype=data nolist;
modify want;
attrib _character_ format=;
run;
quit;
proc print data=work.want;
run;
If you can't already create the source tables with consistent variable attributes then the 2nd best option would be to pre-define a table structure so whatever variations there are in your source tables the target table gets always the same stable set of variables and attributes.
To do so you can create a zero row table with all the variables and then use this table as first source in your SQL or Proc Append or data step.
Here some sample code how to create and use such a zero row table with a SQL union.
data work.table_structure;
attrib
var1 length=$20
;
stop;
run;
proc sql;
create table work.want as
select * from work.table_structure
outer union corr
select * from work.have1
outer union corr
select * from work.have2
;
quit;
proc contents data=work.want;
run;
Thank you Patrick!! These are both excellent, but I really love the table_structure idea - brilliant!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.