Solved: How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

kbinan · Posted 04-10-2024 05:55 PM

I have 3 large datasets, each with normalized naming for 61 attributes, that I need to join into one. When I had tried a DATA STEP I realized it was truncating values based on the contents of the length of any give attribute in the first data set. After some research I stumbled on upon OUTER UNION CORRESPONDING that seemed to solve my problem.

However, I just uncovered a situation, this is just one example, where this evidently isn't working as I had expected. While the attribute naming is consistent the data within is not so when I join them together the truncating is still occurring. Do I have to continually evaluate all 61 attributes? is there a systematic way to produce this reliably every month?

These are large files that we create monthly (millions of records each) so I can't troubleshoot every nuance. How can I get around this to ensure that all attributes join with the full contents of their values? I'm using SAS EG 9.4. and this is a slightly urgent matter as I have a deadline on this.

My basic code is below and I've attached examples of the output. Thank you for any advise you can offer!

PROC SQL;
CREATE TABLE  work.ELIG_DATA AS

						   SELECT * FROM &CRM_FILE.

OUTER UNION CORRESPONDING  SELECT * FROM &UDW_FILE.  

OUTER UNION CORRESPONDING  SELECT * FROM &ELIG_DATA.
; 
QUIT;


/*WORK.ELIG_DATA created, with 7990414 rows and 61 columns.*/

Patrick · Posted 04-10-2024 08:19 PM

From what I understand the root of the problem is that your source tables don't have consistent attributes for the same named variables. Ideally that's the place where you fix the problem. If you can avoid it don't use Proc Import but use a SAS data step /infile/input as this will guarantee that your variables have always same attributes like length, format etc.

Using a outer union corr (or union corr all) set operation the length of the output variable will be the max length of the variable from any of the source tables. There won't be string truncation and I'm pretty sure you don't see any truncation warnings in your log.

The issue you encounter is: Other variable attributes like formats are taken from the first source table. The formats determine how your variable values get displayed. If a character format is shorter than the actual string then what gets printed is only the string up to the length of the format and though the string looks truncated (the internal value isn't).

The quick fix could be to just remove all formats from character variables. Below sample code for table work.want.

proc datasets lib=work memtype=data nolist;
  modify want;
    attrib _character_ format=;
  run;
quit;

Here some code to demonstrate what I'm talking about:

data work.have1;
  attrib var1 length=$5 format=$5.;
  var1='12345';
run;
data work.have2;
  attrib var1 length=$10;
  var1='1234567890';
run;
proc sql;
  create table work.want as
  select * from work.have1
  outer union corr
  select * from work.have2
  ;
quit;

proc contents data=work.want varnum;
run;
proc print data=work.want;
run;

The Proc Contents report shows you that the length has been taken from the 2nd source table but the format comes from the first source table

And because the format is $5 only the first 5 characters get printed (displayed)

If you remove the format then all characters in the string get printed.

proc datasets lib=work memtype=data nolist;
  modify want;
    attrib _character_ format=;
  run;
quit;
proc print data=work.want;
run;

If you can't already create the source tables with consistent variable attributes then the 2nd best option would be to pre-define a table structure so whatever variations there are in your source tables the target table gets always the same stable set of variables and attributes.

To do so you can create a zero row table with all the variables and then use this table as first source in your SQL or Proc Append or data step.

Here some sample code how to create and use such a zero row table with a SQL union.

data work.table_structure;
  attrib 
    var1 length=$20
    ;
  stop;
run;
proc sql;
  create table work.want as
  select * from work.table_structure
  outer union corr
  select * from work.have1
  outer union corr
  select * from work.have2
  ;
quit;
proc contents data=work.want;
run;

View solution in original post

ballardw · Posted 04-10-2024 06:26 PM

You haven't been very clear about explaining what is truncated. Which variable(s), from which source table(s), and which values.

Output without the input is often not sufficient.

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

Suggestion: Provide very small examples of each of the data sets that exhibit this behavior. Provide the data in the form of working data step code. And don't hide which data set is behind which macro variable.

It might help to show the log from running that join. Include the code and all the notes associated.

I am also not at all sure what you mean by attributes, much less "all 61" as I am not opening that spread sheet to see what an "attribute" may be. Since most SAS data sets have at most data length, type, label, format, informat as attributes I strongly suspect you mean something else.

kbinan · Posted 04-10-2024 06:53 PM

I did include an example of what's being truncated in the Excel file I attached. I can't share any further detail because it includes PHI/HIPAA medical claims data. The Excel example is just two fields in the 3 datasets, and the outputs of the result based on how the join is performed. I've included it below in SAS code form.

Attributes(columns/variables/elements/fields) that make up a full record on my dataset. And the '61' means there are 61 columns/attributes on the record.

The log would not show anything as the join runs without error; the issue is the results and the values in the columns being truncated depending on which order the join is performed in. I'm trying to find a way to ensure that the values in the columns are NOT truncated. I mention the 61 because the example I shared is just 2 instances of this happening; it is likely that it is occurring in various combinations across the 3 datasets.

I hope that provides some clarity on what I'm trying to achieve.

proc sql;
create table work.test_UDW as
select distinct 
		SRC_SYS_CD,
		STATUS
from &UDW_FILE.;
quit;
/*work.test_UDW	
	SRC_SYS_CD	STATUS
		UDW		Active
		UDW		UMR
*/

proc sql;
create table work.test_CRM as
select distinct 
		SRC_SYS_CD,
		STATUS
from &CRM_FILE.;
quit;

/*work.test_CRM	
	SRC_SYS_CD	STATUS
		CRM		Cross Carrier
*/


proc sql;
create table work.test_GLX as
select distinct 
		SRC_SYS_CD,
		STATUS
from &ELIG_DATA.;
quit;

/*work.test_GLX	
	SRC_SYS_CD	STATUS
		CE		Active
		CE		New
*/

proc sql;
create table work.test_ALL as
select distinct 
		SRC_SYS_CD,
		STATUS
from work.ELIG_DATA;
quit;

/*
Results when GLX is first in JOIN
SRC_SYS_CD	STATUS
	CE		Active
	CE		New
	CR		Cross Carrier
	UD		Active
	UD		UMR

Results when UDW is first in JOIN
SRC_SYS_CD	STATUS
	CE		Active
	CE		New
	CRM		Cross
	UDW		Active
	UDW		UMR
*/

Tom · Posted 04-10-2024 08:13 PM

I think you left out a step as I don't see any code that this statement could be applied to:

Results when GLX is first in JOIN

If you want share your example data then share it as a data step and not a listing. That way we know the variable types and lengths and whether there are any formats attached to the variables. It is trivial to convert your listings into a data step.

Example:

data udw ;
  infile datalines dsd truncover;
  input SRC_SYS_CD :$3.  STATUS :$6. ;
datalines;
UDW,Active
UDW,UMR
;

Patrick · Posted 04-10-2024 08:19 PM

From what I understand the root of the problem is that your source tables don't have consistent attributes for the same named variables. Ideally that's the place where you fix the problem. If you can avoid it don't use Proc Import but use a SAS data step /infile/input as this will guarantee that your variables have always same attributes like length, format etc.

Using a outer union corr (or union corr all) set operation the length of the output variable will be the max length of the variable from any of the source tables. There won't be string truncation and I'm pretty sure you don't see any truncation warnings in your log.

The issue you encounter is: Other variable attributes like formats are taken from the first source table. The formats determine how your variable values get displayed. If a character format is shorter than the actual string then what gets printed is only the string up to the length of the format and though the string looks truncated (the internal value isn't).

The quick fix could be to just remove all formats from character variables. Below sample code for table work.want.

proc datasets lib=work memtype=data nolist;
  modify want;
    attrib _character_ format=;
  run;
quit;

Here some code to demonstrate what I'm talking about:

data work.have1;
  attrib var1 length=$5 format=$5.;
  var1='12345';
run;
data work.have2;
  attrib var1 length=$10;
  var1='1234567890';
run;
proc sql;
  create table work.want as
  select * from work.have1
  outer union corr
  select * from work.have2
  ;
quit;

proc contents data=work.want varnum;
run;
proc print data=work.want;
run;

The Proc Contents report shows you that the length has been taken from the 2nd source table but the format comes from the first source table

And because the format is $5 only the first 5 characters get printed (displayed)

If you remove the format then all characters in the string get printed.

proc datasets lib=work memtype=data nolist;
  modify want;
    attrib _character_ format=;
  run;
quit;
proc print data=work.want;
run;

If you can't already create the source tables with consistent variable attributes then the 2nd best option would be to pre-define a table structure so whatever variations there are in your source tables the target table gets always the same stable set of variables and attributes.

To do so you can create a zero row table with all the variables and then use this table as first source in your SQL or Proc Append or data step.

Here some sample code how to create and use such a zero row table with a SQL union.

data work.table_structure;
  attrib 
    var1 length=$20
    ;
  stop;
run;
proc sql;
  create table work.want as
  select * from work.table_structure
  outer union corr
  select * from work.have1
  outer union corr
  select * from work.have2
  ;
quit;
proc contents data=work.want;
run;

kbinan · Posted 04-11-2024 02:13 AM

Thank you Patrick!! These are both excellent, but I really love the table_structure idea - brilliant!

How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

Re: How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

Re: How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

Re: How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

Re: How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

Re: How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

Re: How do I make an OUTER UNION CORRESPONDING JOIN not truncate data?

Classroom Training Available!