BookmarkSubscribeRSS Feed
MG18
Lapis Lazuli | Level 10

LIBNAME TAGETLIB DB2 Datasrc=ABDlive SCHEMA=SASDB  AUTHDOMAIN="SASDB_AUTH" ;

 

data source_rep_data;
set tagetlib.src_cons_cibil_table_postval;
/*if datepart(report_date) = '30APR2015'd;*/
run;

PROC SORT data=source_rep_data out=src_temp;
BY account_id int_account_id ownership_indicator;
RUN;
PROC SORT data=sascibil.cibil_consumer_name_add out=modif_temp;
BY account_id int_account_id ownership_indicator;
RUN;

 

 

DATA whole_dat_after_format1;
MERGE src_temp (in=a) modif_temp(in=b);
BY account_id int_account_id ownership_indicator;
IF a=1 ;
RUN;

 

 

I am running above mentioned code and in log it is written below message :-

WARNING: Multiple lengths were specified for the BY variable ACCOUNT_ID by input data sets. This may cause unexpected results.
WARNING: Multiple lengths were specified for the BY variable INT_ACCOUNT_ID by input data sets. This may cause unexpected results.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 1308561 observations read from the data set WORK.SRC_TEMP.
NOTE: There were 2907358 observations read from the data set WORK.MODIF_TEMP.
NOTE: The data set WORK.WHOLE_DAT_AFTER_FORMAT1 has 2907358 observations and 166 variables.

even though i have written  if and in statement why it is showing "NOTE: MERGE statement has more than one data set with repeats of BY values." ?

 

and SRC_TEMP has 13 lakh record with above code should return the same but in actual it is returning 29 lakhs record why it is so ?

it is due to warning ??

Please help in above code to understand and tell me y it is returning 29 lakhs record not the 13 lakhs records. 

 

 

 

5 REPLIES 5
Rhys
Obsidian | Level 7

You have a Cartesian join, de dupe your source tables, with nodupkey or a proc summary.

Shmuel
Garnet | Level 18

Check for which combination(s) of BY keys you have more than one observation in each of the both datasets.

 

You can do it by next code:

data dup1;
 set src_temp;
       by BY account_id int_account_id ownership_indicator;
            if not (first.ownership_indicator and last.ownership_indicator);
run;

data dup2;
 set modif_temp;
       by BY account_id int_account_id ownership_indicator;
            if not (first.ownership_indicator and last.ownership_indicator);
run;

proc sql;
        select distinct account_id, int_account_id, ownership_indicator
         from dup1 as a
         inner join dup2 as b
         on 
            a.account_id = b.account_id
            a.int_account_id =b.int_account_id
            a.ownership_indicator = b.ownership_indicator 
; quit;
Kurt_Bremser
Super User

The fact that the output has the same 2.9 million records that modif_temp has lets me suspect that you have at least one match in src_temp for every observation in modif_temp.

Very quick example:

data one;
x = 1;
run;

data two;
x = 1;
output;
output;
run;

data result;
merge
  one (in=a)
  two (in=b)
;
by x;
if a;
run;

proc print data=result noobs;
run;

Log of the merge:

34         data result;
35         merge
36           one (in=a)
37           two (in=b)
38         ;
39         by x;
40         if a;
41         run;

NOTE: There were 1 observations read from the data set WORK.ONE.
NOTE: There were 2 observations read from the data set WORK.TWO.
NOTE: The data set WORK.RESULT has 2 observations and 1 variables.
Kurt_Bremser
Super User

And you need to clean up your data, so that these messages vanish:

WARNING: Multiple lengths were specified for the BY variable ACCOUNT_ID by input data sets. This may cause unexpected results.
WARNING: Multiple lengths were specified for the BY variable INT_ACCOUNT_ID by input data sets. This may cause unexpected results.
NOTE: MERGE statement has more than one data set with repeats of BY values.

All these indicate conditions that can (and usually will) make the merge return incorrect results.

 

Maxim 3: Know your data.

Astounding
PROC Star

The WARNING can cause the NOTE.

 

The WARNING indicates that one (or more) of your BY variables is being truncated when executing the DATA step.  That can cause the data to appear like there are different match patterns occurring.  

 

The solution is easy.  Find the larger length for each variable, and insert a LENGTH statement before the MERGE statement, assigning that maximum length.

 

If there are other problems with the data, they will need to be fixed.  But it is also possible that assigning the proper LENGTH will clear up everything.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 4053 views
  • 0 likes
  • 5 in conversation