BookmarkSubscribeRSS Feed
Desi2003
Calcite | Level 5

Below is code I am working on for a project. Physical Activity and Geo_Reference_Table each have 1249 observations. When I merge them, I get no errors or warnings, but my number of observations goes up to 2009. Why would this be the case, and how do I fix it? 

 

*Rename CT in physical activity to tract_name*;
DATA Phy_Act1;
	SET	CoImpt.PhysicalActivity;
	*RENAME	CT	= TRACT_NAME;
	RUN;
	
proc sort data=CoImpt.PhysicalActivity;
   by TRACT_NAME;
run;

proc print data=CoImpt.PhysicalActivity;
run;
	
*REFERENCE TABLE*;
*Merge physical activity with the reference table*
*Keep fips, county, tract_name, and adult_pop*;
*DATA Ref_table;
  * MERGE CoImpt.PhysicalActivity CoImpt.Geo_Reference_Table;
   *BY TRACT_NAME;
  * KEEP FIPS COUNTY TRACT_NAME adult_pop;
  * RUN;
proc sort data=CoImpt.Geo_Reference_Table;
   by TRACT_NAME;
run;

proc print data=CoImpt.Geo_Reference_Table;
run;

DATA	Ref_Table;
	MERGE	CoImpt.Geo_Reference_Table 
			CoImpt.PhysicalActivity;
	BY	TRACT_NAME;
	RUN;

PROC PRINT DATA = Ref_Table;
	RUN;
15 REPLIES 15
Astounding
PROC Star
Mismatches. There are many TRACT_NAME values that appear in one data set but not the other.

Do you need any help in locating matches vs. mismatches?
antonbcristina
SAS Employee

If merging two datasets results in more observations than the larger of the two individual datasets, than it means that certain matches were not found. It helps to see the log as well to identify the issue. Feel free to post it and we can investigate. 

Desi2003
Calcite | Level 5

 

Thank you both for your responses. Here is my log. I am happy to provide additional information as needed. 
1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 68         
 69         %LET CourseRoot = /home/u63571562/BIOS6680_2023;
 70         LIBNAME CoImpt "&CourseRoot/CDPHE Study/Data/2_Import";
 NOTE: Libref COIMPT refers to the same physical library as _TEMP0.
 NOTE: Libref COIMPT was successfully assigned as follows: 
       Engine:        V9 
       Physical Name: /home/u63571562/BIOS6680_2023/CDPHE Study/Data/2_Import
 71         LIBNAME CoAnl "&CourseRoot/CDPHE Study/Data/4_Analysis";
 NOTE: Libref COANL was successfully assigned as follows: 
       Engine:        V9 
       Physical Name: /home/u63571562/BIOS6680_2023/CDPHE Study/Data/4_Analysis
 72         
 73         *Rename CT in physical activity to tract_name*;
 74         DATA Phy_Act1;
 75         SETCoImpt.PhysicalActivity;
 76         *RENAMECT= TRACT_NAME;
 77         RUN;
 
 NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY.
 NOTE: The data set WORK.PHY_ACT1 has 1249 observations and 6 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              1200.40k
       OS Memory           23192.00k
       Timestamp           11/08/2023 08:24:10 PM
       Step Count                        81  Switch Count  2
       Page Faults                       0
       Page Reclaims                     98
       Page Swaps                        0
       Voluntary Context Switches        19
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           520
       
 
 78         
 79         proc sort data=CoImpt.PhysicalActivity;
 80            by TRACT_NAME;
 81         run;
 
 NOTE: Input data set is already sorted, no sorting done.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              696.71k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:10 PM
       Step Count                        82  Switch Count  0
       Page Faults                       0
       Page Reclaims                     91
       Page Swaps                        0
       Voluntary Context Switches        3
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0
       
 
 82         
 83         proc print data=CoImpt.PhysicalActivity;
 84         run;
 
 NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY.
 NOTE: PROCEDURE PRINT used (Total process time):
       real time           0.93 seconds
       user cpu time       0.94 seconds
       system cpu time     0.00 seconds
       memory              1390.62k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:11 PM
       Step Count                        83  Switch Count  0
       Page Faults                       0
       Page Reclaims                     79
       Page Swaps                        0
       Voluntary Context Switches        1
       Involuntary Context Switches      2
       Block Input Operations            0
       Block Output Operations           704
       
 
 85         
 86         *REFERENCE TABLE*;
 87         *Merge physical activity with the reference table*
 88         *Keep fips, county, tract_name, and adult_pop*;
 89         *DATA Ref_table;
 90           * MERGE CoImpt.PhysicalActivity CoImpt.Geo_Reference_Table;
 91            *BY TRACT_NAME;
 92           * KEEP FIPS COUNTY TRACT_NAME adult_pop;
 93           * RUN;
 94         proc sort data=CoImpt.Geo_Reference_Table;
 95            by TRACT_NAME;
 96         run;
 
 NOTE: Input data set is already sorted, no sorting done.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              537.81k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:11 PM
       Step Count                        84  Switch Count  0
       Page Faults                       0
       Page Reclaims                     50
       Page Swaps                        0
       Voluntary Context Switches        5
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0
       
 
 97         
 98         proc print data=CoImpt.Geo_Reference_Table;
 99         run;
 
 NOTE: There were 1249 observations read from the data set COIMPT.GEO_REFERENCE_TABLE.
 NOTE: PROCEDURE PRINT used (Total process time):
       real time           0.58 seconds
       user cpu time       0.59 seconds
       system cpu time     0.00 seconds
       memory              648.46k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:12 PM
       Step Count                        85  Switch Count  0
       Page Faults                       0
       Page Reclaims                     66
       Page Swaps                        0
       Voluntary Context Switches        3
       Involuntary Context Switches      1
       Block Input Operations            0
       Block Output Operations           456
       
 
 100        
 101        DATARef_Table;
 102        MERGECoImpt.Geo_Reference_Table
 103        CoImpt.PhysicalActivity;
 104        BYTRACT_NAME;
 105        RUN;
 
 NOTE: There were 1249 observations read from the data set COIMPT.GEO_REFERENCE_TABLE.
 NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY.
 NOTE: The data set WORK.REF_TABLE has 2009 observations and 8 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.01 seconds
       system cpu time     0.00 seconds
       memory              1658.62k
       OS Memory           23708.00k
       Timestamp           11/08/2023 08:24:12 PM
       Step Count                        86  Switch Count  2
       Page Faults                       0
       Page Reclaims                     181
       Page Swaps                        0
       Voluntary Context Switches        18
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           520
       
 
 106        
 107        PROC PRINT DATA = Ref_Table;
 108        RUN;
 
 NOTE: There were 2009 observations read from the data set WORK.REF_TABLE.
 NOTE: PROCEDURE PRINT used (Total process time):
       real time           1.96 seconds
       user cpu time       1.94 seconds
       system cpu time     0.01 seconds
       memory              848.21k
       OS Memory           23188.00k
       Timestamp           11/08/2023 08:24:14 PM
       Step Count                        87  Switch Count  0
       Page Faults                       0
       Page Reclaims                     65
       Page Swaps                        0
       Voluntary Context Switches        0
       Involuntary Context Switches      373
       Block Input Operations            0
       Block Output Operations           1344
       
 
 109        
 110        OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 120    
Desi2003
Calcite | Level 5

A follow-up: I have just realized that TRACT_NAME in Geo_Reference_Table is 49 in length and PhysicalActivity is 48 in length (they are both character variables). Could this be the problem? How do I address this?

antonbcristina
SAS Employee

Idealy you'd want them to be the same length. However, it doesn't seem to be the case for your datasets otherwise you should see the following in the log:

 

WARNING: Multiple lengths were specified for the BY variable <variable name> by input data sets. This might cause 
          unexpected results.
Desi2003
Calcite | Level 5

It does not give me a warning, it is just adding those lines. The code runs with no errors or warnings at this point, but I should only have 1249 observations.

antonbcristina
SAS Employee

The reason your merge results in more observations is because certain matches were not found. You'll want to identify which observations were not found by using a version of the code below.

 

Create indicator variables (call them a and b) to identify observations coming from the two datasets. You can use an IF statement to filter observations that belong to one dataset (a=1) but not the other (b=0).  

 

data want;
	merge one(in=a) two(in=b);
	by id;
	if a=1 and b=0; /* Observations in dataset one that are not present in dataset two */
run;	

 

ballardw
Super User

@Desi2003 wrote:

A follow-up: I have just realized that TRACT_NAME in Geo_Reference_Table is 49 in length and PhysicalActivity is 48 in length (they are both character variables). Could this be the problem? How do I address this?


Depending on the order of the data sets on the merge statement you may have different results in such a case.

Consider the following very small examples:

data one;
  length x $ 1;
  input x y;
datalines;
a  1
b  2
;

data two;
  length x $ 2;
  input x z;
datalines;
a  11
aa 22
;

data merged;
   merge one two;
   by x;
run;

data merged2;
   merge two one;
   by x;
run;

The order of the sets with the short/long version of the variable with different lengths makes the output quite a bit different. When the short valued data set is first on the merge statement then the length of the other is shortened before comparing for the BY, resulting in the "merge" as if "a" = "aa" , see the value of the Y and Z variables. When the order is reversed then the full length of the longer used for the By and you don't get a matching Y for the AA value of X.

 

 

Desi2003
Calcite | Level 5

I see your point. I tried this with my data, and it had the same problem, no matter which came first in the statement. Below I have attached the two files I am working with, along with some updated code:

*****************************************************************************************
*CREATING REFERENCE TABLE*;
*Rename CT in PhysicalActivity to TRACT_NAME*;
DATA Phy_Act1;
	SET	CoImpt.PhysicalActivity;
	*RENAME	CT = TRACT_NAME;
	RUN;
	
DATA Phy_Act2;
    LENGTH TRACT_NAME $49;
	SET CoImpt.PhysicalActivity;
    RUN;
	
PROC SORT DATA=Phy_Act2;
   BY TRACT_NAME;
   RUN;

PROC SORT DATA=CoImpt.Geo_Reference_Table;
   BY TRACT_NAME;
   RUN;
  
*Merge PhysicalActivity with reference table (KEEP TRACT_NAME adult_pop COUNTY FIPS)*;
DATA	Ref_Tab;
	MERGE	CoImpt.Geo_Reference_Table 
			Phy_Act2;
	BY	TRACT_NAME;
	KEEP TRACT_NAME adult_pop COUNTY FIPS;
	RUN;

PROC PRINT DATA = Ref_Tab;
	RUN;
Astounding
PROC Star

Back to the idea of mismatches.  One poster has already shown you how to identify some of the mismatches.  Here is a more complete program.  After sorting:

DATA no_Ref_Table no_Act_Table match;
   MERGE CoImpt.Geo_Reference_Table (in=ref) 
		  CoImpt.PhysicalActivity (in=act);
   BY TRACT_NAME;
   if ref=0 then output no_Ref_Table;
   else if act=0 then output no_Act_Table;
   else output match;
run;

From that point, you just have to roll up your sleeves and look at the data.  There may be some obvious patterns.  For example, it may be that TRACT_NAME values are all caps in one data set, but just initial caps in the other data set.  You just have to explore the output to find out.

Desi2003
Calcite | Level 5

Okay- based on this code, it looks like the last letter of the word in the observations is being cut off ('Colorado' is coming out 'Colorad'). This is interesting, as the variables in both data sets appear correctly on their own and are both character variables with length 49. Man oh man, SAS can be confusing. 

Desi2003
Calcite | Level 5

Wait! It is being truncated based on census tract number length. For example, "Census Tract 10.01, Mesa County, Colorado" is coming out "Census Tract 10.01, Mesa County, Colorad", but "Census Tract 101.05, Teller County, Colorado" is coming out "Census Tract 101.05, Teller County, Colora"

ballardw
Super User

Was some of this data brought into SAS using Proc Import or using an "import" widget (that usually calls Proc Import)?

 

This is a symptom of varying lengths of values and unless using the proper options then the length of the first rows of data set the length of the variable. So if the longest value encountered in the first 20 or so rows then values get truncated to the length encountered in the first 20.

 

If you used Proc Import code to read the data then go back to that step and set GUESSINGROWS=MAX;

If you used a widget look for a setting on "number of rows to guess" or similar and set to a large number like 32000.

Or write your own data step code to read the variables with the intended length.

Desi2003
Calcite | Level 5

Thank you for your input, but I did not use an INPUT statement. I also made sure both sets are character variable with 49 length. 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 3094 views
  • 0 likes
  • 4 in conversation