Re: Why is this code adding observations to a merged dataset?

Desi2003 · Posted 11-08-2023 02:30 PM

Below is code I am working on for a project. Physical Activity and Geo_Reference_Table each have 1249 observations. When I merge them, I get no errors or warnings, but my number of observations goes up to 2009. Why would this be the case, and how do I fix it?

*Rename CT in physical activity to tract_name*;
DATA Phy_Act1;
	SET	CoImpt.PhysicalActivity;
	*RENAME	CT	= TRACT_NAME;
	RUN;
	
proc sort data=CoImpt.PhysicalActivity;
   by TRACT_NAME;
run;

proc print data=CoImpt.PhysicalActivity;
run;
	
*REFERENCE TABLE*;
*Merge physical activity with the reference table*
*Keep fips, county, tract_name, and adult_pop*;
*DATA Ref_table;
  * MERGE CoImpt.PhysicalActivity CoImpt.Geo_Reference_Table;
   *BY TRACT_NAME;
  * KEEP FIPS COUNTY TRACT_NAME adult_pop;
  * RUN;
proc sort data=CoImpt.Geo_Reference_Table;
   by TRACT_NAME;
run;

proc print data=CoImpt.Geo_Reference_Table;
run;

DATA	Ref_Table;
	MERGE	CoImpt.Geo_Reference_Table 
			CoImpt.PhysicalActivity;
	BY	TRACT_NAME;
	RUN;

PROC PRINT DATA = Ref_Table;
	RUN;

Astounding · Posted 11-08-2023 02:49 PM

Mismatches. There are many TRACT_NAME values that appear in one data set but not the other.

Do you need any help in locating matches vs. mismatches?

antonbcristina · Posted 11-08-2023 02:50 PM

If merging two datasets results in more observations than the larger of the two individual datasets, than it means that certain matches were not found. It helps to see the log as well to identify the issue. Feel free to post it and we can investigate.

Desi2003 · Posted 11-08-2023 03:25 PM

Thank you both for your responses. Here is my log. I am happy to provide additional information as needed.

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 68         
 69         %LET CourseRoot = /home/u63571562/BIOS6680_2023;
 70         LIBNAME CoImpt "&CourseRoot/CDPHE Study/Data/2_Import";
 NOTE: Libref COIMPT refers to the same physical library as _TEMP0.
 NOTE: Libref COIMPT was successfully assigned as follows: 
       Engine:        V9 
       Physical Name: /home/u63571562/BIOS6680_2023/CDPHE Study/Data/2_Import
 71         LIBNAME CoAnl "&CourseRoot/CDPHE Study/Data/4_Analysis";
 NOTE: Libref COANL was successfully assigned as follows: 
       Engine:        V9 
       Physical Name: /home/u63571562/BIOS6680_2023/CDPHE Study/Data/4_Analysis
 72         
 73         *Rename CT in physical activity to tract_name*;
 74         DATA Phy_Act1;
 75         SETCoImpt.PhysicalActivity;
 76         *RENAMECT= TRACT_NAME;
 77         RUN;
 
 NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY.
 NOTE: The data set WORK.PHY_ACT1 has 1249 observations and 6 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              1200.40k
       OS Memory           23192.00k
       Timestamp           11/08/2023 08:24:10 PM
       Step Count                        81  Switch Count  2
       Page Faults                       0
       Page Reclaims                     98
       Page Swaps                        0
       Voluntary Context Switches        19
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           520
       
 
 78         
 79         proc sort data=CoImpt.PhysicalActivity;
 80            by TRACT_NAME;
 81         run;
 
 NOTE: Input data set is already sorted, no sorting done.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              696.71k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:10 PM
       Step Count                        82  Switch Count  0
       Page Faults                       0
       Page Reclaims                     91
       Page Swaps                        0
       Voluntary Context Switches        3
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0
       
 
 82         
 83         proc print data=CoImpt.PhysicalActivity;
 84         run;
 
 NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY.
 NOTE: PROCEDURE PRINT used (Total process time):
       real time           0.93 seconds
       user cpu time       0.94 seconds
       system cpu time     0.00 seconds
       memory              1390.62k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:11 PM
       Step Count                        83  Switch Count  0
       Page Faults                       0
       Page Reclaims                     79
       Page Swaps                        0
       Voluntary Context Switches        1
       Involuntary Context Switches      2
       Block Input Operations            0
       Block Output Operations           704
       
 
 85         
 86         *REFERENCE TABLE*;
 87         *Merge physical activity with the reference table*
 88         *Keep fips, county, tract_name, and adult_pop*;
 89         *DATA Ref_table;
 90           * MERGE CoImpt.PhysicalActivity CoImpt.Geo_Reference_Table;
 91            *BY TRACT_NAME;
 92           * KEEP FIPS COUNTY TRACT_NAME adult_pop;
 93           * RUN;
 94         proc sort data=CoImpt.Geo_Reference_Table;
 95            by TRACT_NAME;
 96         run;
 
 NOTE: Input data set is already sorted, no sorting done.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              537.81k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:11 PM
       Step Count                        84  Switch Count  0
       Page Faults                       0
       Page Reclaims                     50
       Page Swaps                        0
       Voluntary Context Switches        5
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0
       
 
 97         
 98         proc print data=CoImpt.Geo_Reference_Table;
 99         run;
 
 NOTE: There were 1249 observations read from the data set COIMPT.GEO_REFERENCE_TABLE.
 NOTE: PROCEDURE PRINT used (Total process time):
       real time           0.58 seconds
       user cpu time       0.59 seconds
       system cpu time     0.00 seconds
       memory              648.46k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:12 PM
       Step Count                        85  Switch Count  0
       Page Faults                       0
       Page Reclaims                     66
       Page Swaps                        0
       Voluntary Context Switches        3
       Involuntary Context Switches      1
       Block Input Operations            0
       Block Output Operations           456
       
 
 100        
 101        DATARef_Table;
 102        MERGECoImpt.Geo_Reference_Table
 103        CoImpt.PhysicalActivity;
 104        BYTRACT_NAME;
 105        RUN;
 
 NOTE: There were 1249 observations read from the data set COIMPT.GEO_REFERENCE_TABLE.
 NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY.
 NOTE: The data set WORK.REF_TABLE has 2009 observations and 8 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.01 seconds
       system cpu time     0.00 seconds
       memory              1658.62k
       OS Memory           23708.00k
       Timestamp           11/08/2023 08:24:12 PM
       Step Count                        86  Switch Count  2
       Page Faults                       0
       Page Reclaims                     181
       Page Swaps                        0
       Voluntary Context Switches        18
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           520
       
 
 106        
 107        PROC PRINT DATA = Ref_Table;
 108        RUN;
 
 NOTE: There were 2009 observations read from the data set WORK.REF_TABLE.
 NOTE: PROCEDURE PRINT used (Total process time):
       real time           1.96 seconds
       user cpu time       1.94 seconds
       system cpu time     0.01 seconds
       memory              848.21k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:14 PM
       Step Count                        87  Switch Count  0
       Page Faults                       0
       Page Reclaims                     65
       Page Swaps                        0
       Voluntary Context Switches        0
       Involuntary Context Switches      373
       Block Input Operations            0
       Block Output Operations           1344
       
 
 109        
 110        OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 120

Desi2003 · Posted 11-08-2023 03:40 PM

A follow-up: I have just realized that TRACT_NAME in Geo_Reference_Table is 49 in length and PhysicalActivity is 48 in length (they are both character variables). Could this be the problem? How do I address this?

antonbcristina · Posted 11-08-2023 03:53 PM

Idealy you'd want them to be the same length. However, it doesn't seem to be the case for your datasets otherwise you should see the following in the log:

WARNING: Multiple lengths were specified for the BY variable <variable name> by input data sets. This might cause 
          unexpected results.

Desi2003 · Posted 11-08-2023 03:54 PM

It does not give me a warning, it is just adding those lines. The code runs with no errors or warnings at this point, but I should only have 1249 observations.

antonbcristina · Posted 11-08-2023 04:04 PM

The reason your merge results in more observations is because certain matches were not found. You'll want to identify which observations were not found by using a version of the code below.

Create indicator variables (call them a and b) to identify observations coming from the two datasets. You can use an IF statement to filter observations that belong to one dataset (a=1) but not the other (b=0).

data want;
	merge one(in=a) two(in=b);
	by id;
	if a=1 and b=0; /* Observations in dataset one that are not present in dataset two */
run;

ballardw · Posted 11-08-2023 04:23 PM

@Desi2003 wrote:

A follow-up: I have just realized that TRACT_NAME in Geo_Reference_Table is 49 in length and PhysicalActivity is 48 in length (they are both character variables). Could this be the problem? How do I address this?

Depending on the order of the data sets on the merge statement you may have different results in such a case.

Consider the following very small examples:

data one;
  length x $ 1;
  input x y;
datalines;
a  1
b  2
;

data two;
  length x $ 2;
  input x z;
datalines;
a  11
aa 22
;

data merged;
   merge one two;
   by x;
run;

data merged2;
   merge two one;
   by x;
run;

The order of the sets with the short/long version of the variable with different lengths makes the output quite a bit different. When the short valued data set is first on the merge statement then the length of the other is shortened before comparing for the BY, resulting in the "merge" as if "a" = "aa" , see the value of the Y and Z variables. When the order is reversed then the full length of the longer used for the By and you don't get a matching Y for the AA value of X.

Desi2003 · Posted 11-08-2023 04:34 PM

I see your point. I tried this with my data, and it had the same problem, no matter which came first in the statement. Below I have attached the two files I am working with, along with some updated code:

*****************************************************************************************
*CREATING REFERENCE TABLE*;
*Rename CT in PhysicalActivity to TRACT_NAME*;
DATA Phy_Act1;
	SET	CoImpt.PhysicalActivity;
	*RENAME	CT = TRACT_NAME;
	RUN;
	
DATA Phy_Act2;
    LENGTH TRACT_NAME $49;
	SET CoImpt.PhysicalActivity;
    RUN;
	
PROC SORT DATA=Phy_Act2;
   BY TRACT_NAME;
   RUN;

PROC SORT DATA=CoImpt.Geo_Reference_Table;
   BY TRACT_NAME;
   RUN;
  
*Merge PhysicalActivity with reference table (KEEP TRACT_NAME adult_pop COUNTY FIPS)*;
DATA	Ref_Tab;
	MERGE	CoImpt.Geo_Reference_Table 
			Phy_Act2;
	BY	TRACT_NAME;
	KEEP TRACT_NAME adult_pop COUNTY FIPS;
	RUN;

PROC PRINT DATA = Ref_Tab;
	RUN;

Astounding · Posted 11-08-2023 04:33 PM

Back to the idea of mismatches. One poster has already shown you how to identify some of the mismatches. Here is a more complete program. After sorting:

DATA no_Ref_Table no_Act_Table match;
   MERGE CoImpt.Geo_Reference_Table (in=ref) 
		  CoImpt.PhysicalActivity (in=act);
   BY TRACT_NAME;
   if ref=0 then output no_Ref_Table;
   else if act=0 then output no_Act_Table;
   else output match;
run;

From that point, you just have to roll up your sleeves and look at the data. There may be some obvious patterns. For example, it may be that TRACT_NAME values are all caps in one data set, but just initial caps in the other data set. You just have to explore the output to find out.

Desi2003 · Posted 11-08-2023 04:55 PM

Okay- based on this code, it looks like the last letter of the word in the observations is being cut off ('Colorado' is coming out 'Colorad'). This is interesting, as the variables in both data sets appear correctly on their own and are both character variables with length 49. Man oh man, SAS can be confusing.

Desi2003 · Posted 11-08-2023 05:02 PM

Wait! It is being truncated based on census tract number length. For example, "Census Tract 10.01, Mesa County, Colorado" is coming out "Census Tract 10.01, Mesa County, Colorad", but "Census Tract 101.05, Teller County, Colorado" is coming out "Census Tract 101.05, Teller County, Colora"

ballardw · Posted 11-08-2023 06:05 PM

Was some of this data brought into SAS using Proc Import or using an "import" widget (that usually calls Proc Import)?

This is a symptom of varying lengths of values and unless using the proper options then the length of the first rows of data set the length of the variable. So if the longest value encountered in the first 20 or so rows then values get truncated to the length encountered in the first 20.

If you used Proc Import code to read the data then go back to that step and set GUESSINGROWS=MAX;

If you used a widget look for a setting on "number of rows to guess" or similar and set to a large number like 32000.

Or write your own data step code to read the variables with the intended length.

Desi2003 · Posted 11-08-2023 06:54 PM

Thank you for your input, but I did not use an INPUT statement. I also made sure both sets are character variable with 49 length.

Registration is open