Below is code I am working on for a project. Physical Activity and Geo_Reference_Table each have 1249 observations. When I merge them, I get no errors or warnings, but my number of observations goes up to 2009. Why would this be the case, and how do I fix it?
*Rename CT in physical activity to tract_name*; DATA Phy_Act1; SET CoImpt.PhysicalActivity; *RENAME CT = TRACT_NAME; RUN; proc sort data=CoImpt.PhysicalActivity; by TRACT_NAME; run; proc print data=CoImpt.PhysicalActivity; run; *REFERENCE TABLE*; *Merge physical activity with the reference table* *Keep fips, county, tract_name, and adult_pop*; *DATA Ref_table; * MERGE CoImpt.PhysicalActivity CoImpt.Geo_Reference_Table; *BY TRACT_NAME; * KEEP FIPS COUNTY TRACT_NAME adult_pop; * RUN; proc sort data=CoImpt.Geo_Reference_Table; by TRACT_NAME; run; proc print data=CoImpt.Geo_Reference_Table; run; DATA Ref_Table; MERGE CoImpt.Geo_Reference_Table CoImpt.PhysicalActivity; BY TRACT_NAME; RUN; PROC PRINT DATA = Ref_Table; RUN;
If merging two datasets results in more observations than the larger of the two individual datasets, than it means that certain matches were not found. It helps to see the log as well to identify the issue. Feel free to post it and we can investigate.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 68 69 %LET CourseRoot = /home/u63571562/BIOS6680_2023; 70 LIBNAME CoImpt "&CourseRoot/CDPHE Study/Data/2_Import"; NOTE: Libref COIMPT refers to the same physical library as _TEMP0. NOTE: Libref COIMPT was successfully assigned as follows: Engine: V9 Physical Name: /home/u63571562/BIOS6680_2023/CDPHE Study/Data/2_Import 71 LIBNAME CoAnl "&CourseRoot/CDPHE Study/Data/4_Analysis"; NOTE: Libref COANL was successfully assigned as follows: Engine: V9 Physical Name: /home/u63571562/BIOS6680_2023/CDPHE Study/Data/4_Analysis 72 73 *Rename CT in physical activity to tract_name*; 74 DATA Phy_Act1; 75 SETCoImpt.PhysicalActivity; 76 *RENAMECT= TRACT_NAME; 77 RUN; NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY. NOTE: The data set WORK.PHY_ACT1 has 1249 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 1200.40k OS Memory 23192.00k Timestamp 11/08/2023 08:24:10 PM Step Count 81 Switch Count 2 Page Faults 0 Page Reclaims 98 Page Swaps 0 Voluntary Context Switches 19 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 520 78 79 proc sort data=CoImpt.PhysicalActivity; 80 by TRACT_NAME; 81 run; NOTE: Input data set is already sorted, no sorting done. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 696.71k OS Memory 23188.00k Timestamp 11/08/2023 08:24:10 PM Step Count 82 Switch Count 0 Page Faults 0 Page Reclaims 91 Page Swaps 0 Voluntary Context Switches 3 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 0 82 83 proc print data=CoImpt.PhysicalActivity; 84 run; NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY. NOTE: PROCEDURE PRINT used (Total process time): real time 0.93 seconds user cpu time 0.94 seconds system cpu time 0.00 seconds memory 1390.62k OS Memory 23188.00k Timestamp 11/08/2023 08:24:11 PM Step Count 83 Switch Count 0 Page Faults 0 Page Reclaims 79 Page Swaps 0 Voluntary Context Switches 1 Involuntary Context Switches 2 Block Input Operations 0 Block Output Operations 704 85 86 *REFERENCE TABLE*; 87 *Merge physical activity with the reference table* 88 *Keep fips, county, tract_name, and adult_pop*; 89 *DATA Ref_table; 90 * MERGE CoImpt.PhysicalActivity CoImpt.Geo_Reference_Table; 91 *BY TRACT_NAME; 92 * KEEP FIPS COUNTY TRACT_NAME adult_pop; 93 * RUN; 94 proc sort data=CoImpt.Geo_Reference_Table; 95 by TRACT_NAME; 96 run; NOTE: Input data set is already sorted, no sorting done. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 537.81k OS Memory 23188.00k Timestamp 11/08/2023 08:24:11 PM Step Count 84 Switch Count 0 Page Faults 0 Page Reclaims 50 Page Swaps 0 Voluntary Context Switches 5 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 0 97 98 proc print data=CoImpt.Geo_Reference_Table; 99 run; NOTE: There were 1249 observations read from the data set COIMPT.GEO_REFERENCE_TABLE. NOTE: PROCEDURE PRINT used (Total process time): real time 0.58 seconds user cpu time 0.59 seconds system cpu time 0.00 seconds memory 648.46k OS Memory 23188.00k Timestamp 11/08/2023 08:24:12 PM Step Count 85 Switch Count 0 Page Faults 0 Page Reclaims 66 Page Swaps 0 Voluntary Context Switches 3 Involuntary Context Switches 1 Block Input Operations 0 Block Output Operations 456 100 101 DATARef_Table; 102 MERGECoImpt.Geo_Reference_Table 103 CoImpt.PhysicalActivity; 104 BYTRACT_NAME; 105 RUN; NOTE: There were 1249 observations read from the data set COIMPT.GEO_REFERENCE_TABLE. NOTE: There were 1249 observations read from the data set COIMPT.PHYSICALACTIVITY. NOTE: The data set WORK.REF_TABLE has 2009 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.01 seconds system cpu time 0.00 seconds memory 1658.62k OS Memory 23708.00k Timestamp 11/08/2023 08:24:12 PM Step Count 86 Switch Count 2 Page Faults 0 Page Reclaims 181 Page Swaps 0 Voluntary Context Switches 18 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 520 106 107 PROC PRINT DATA = Ref_Table; 108 RUN; NOTE: There were 2009 observations read from the data set WORK.REF_TABLE. NOTE: PROCEDURE PRINT used (Total process time): real time 1.96 seconds user cpu time 1.94 seconds system cpu time 0.01 seconds memory 848.21k OS Memory 23188.00k Timestamp 11/08/2023 08:24:14 PM Step Count 87 Switch Count 0 Page Faults 0 Page Reclaims 65 Page Swaps 0 Voluntary Context Switches 0 Involuntary Context Switches 373 Block Input Operations 0 Block Output Operations 1344 109 110 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 120
A follow-up: I have just realized that TRACT_NAME in Geo_Reference_Table is 49 in length and PhysicalActivity is 48 in length (they are both character variables). Could this be the problem? How do I address this?
Idealy you'd want them to be the same length. However, it doesn't seem to be the case for your datasets otherwise you should see the following in the log:
WARNING: Multiple lengths were specified for the BY variable <variable name> by input data sets. This might cause unexpected results.
It does not give me a warning, it is just adding those lines. The code runs with no errors or warnings at this point, but I should only have 1249 observations.
The reason your merge results in more observations is because certain matches were not found. You'll want to identify which observations were not found by using a version of the code below.
Create indicator variables (call them a and b) to identify observations coming from the two datasets. You can use an IF statement to filter observations that belong to one dataset (a=1) but not the other (b=0).
data want; merge one(in=a) two(in=b); by id; if a=1 and b=0; /* Observations in dataset one that are not present in dataset two */ run;
@Desi2003 wrote:
A follow-up: I have just realized that TRACT_NAME in Geo_Reference_Table is 49 in length and PhysicalActivity is 48 in length (they are both character variables). Could this be the problem? How do I address this?
Depending on the order of the data sets on the merge statement you may have different results in such a case.
Consider the following very small examples:
data one; length x $ 1; input x y; datalines; a 1 b 2 ; data two; length x $ 2; input x z; datalines; a 11 aa 22 ; data merged; merge one two; by x; run; data merged2; merge two one; by x; run;
The order of the sets with the short/long version of the variable with different lengths makes the output quite a bit different. When the short valued data set is first on the merge statement then the length of the other is shortened before comparing for the BY, resulting in the "merge" as if "a" = "aa" , see the value of the Y and Z variables. When the order is reversed then the full length of the longer used for the By and you don't get a matching Y for the AA value of X.
I see your point. I tried this with my data, and it had the same problem, no matter which came first in the statement. Below I have attached the two files I am working with, along with some updated code:
***************************************************************************************** *CREATING REFERENCE TABLE*; *Rename CT in PhysicalActivity to TRACT_NAME*; DATA Phy_Act1; SET CoImpt.PhysicalActivity; *RENAME CT = TRACT_NAME; RUN; DATA Phy_Act2; LENGTH TRACT_NAME $49; SET CoImpt.PhysicalActivity; RUN; PROC SORT DATA=Phy_Act2; BY TRACT_NAME; RUN; PROC SORT DATA=CoImpt.Geo_Reference_Table; BY TRACT_NAME; RUN; *Merge PhysicalActivity with reference table (KEEP TRACT_NAME adult_pop COUNTY FIPS)*; DATA Ref_Tab; MERGE CoImpt.Geo_Reference_Table Phy_Act2; BY TRACT_NAME; KEEP TRACT_NAME adult_pop COUNTY FIPS; RUN; PROC PRINT DATA = Ref_Tab; RUN;
Back to the idea of mismatches. One poster has already shown you how to identify some of the mismatches. Here is a more complete program. After sorting:
DATA no_Ref_Table no_Act_Table match;
MERGE CoImpt.Geo_Reference_Table (in=ref)
CoImpt.PhysicalActivity (in=act);
BY TRACT_NAME;
if ref=0 then output no_Ref_Table;
else if act=0 then output no_Act_Table;
else output match;
run;
From that point, you just have to roll up your sleeves and look at the data. There may be some obvious patterns. For example, it may be that TRACT_NAME values are all caps in one data set, but just initial caps in the other data set. You just have to explore the output to find out.
Okay- based on this code, it looks like the last letter of the word in the observations is being cut off ('Colorado' is coming out 'Colorad'). This is interesting, as the variables in both data sets appear correctly on their own and are both character variables with length 49. Man oh man, SAS can be confusing.
Wait! It is being truncated based on census tract number length. For example, "Census Tract 10.01, Mesa County, Colorado" is coming out "Census Tract 10.01, Mesa County, Colorad", but "Census Tract 101.05, Teller County, Colorado" is coming out "Census Tract 101.05, Teller County, Colora"
Was some of this data brought into SAS using Proc Import or using an "import" widget (that usually calls Proc Import)?
This is a symptom of varying lengths of values and unless using the proper options then the length of the first rows of data set the length of the variable. So if the longest value encountered in the first 20 or so rows then values get truncated to the length encountered in the first 20.
If you used Proc Import code to read the data then go back to that step and set GUESSINGROWS=MAX;
If you used a widget look for a setting on "number of rows to guess" or similar and set to a large number like 32000.
Or write your own data step code to read the variables with the intended length.
Thank you for your input, but I did not use an INPUT statement. I also made sure both sets are character variable with 49 length.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.