Hi Folks:
I'm merging more than twenty survey datasets together with different sources. Is there any way to standardize the length of ID1NAME & IDNAME across all datasets efficiently? Specifying the length of variables for each datasets one by one would be tedious.
Any hints are appreciated.
data alldata_id; /*259*/
merge m.reference_columns
Apt_rent_id Car_per_pp_id
Hh_type_id N_houses_id N_manufact_id
N_service_comp_id N_students_per_class_id
N_welfare_facility_id Ratio_health_welfare_id
Urbanicity_id Workers_per_pp_id Disability_id
Gdp_id Insured_pop_id N_employees_id
Retail_workers_id Vac_rate_id;
by id1name idname;
run;
But i got the error below:
WARNING: Multiple lengths were specified for the BY variable ID1NAME by input data sets. This
might cause unexpected results.
WARNING: Multiple lengths were specified for the BY variable IDNAME by input data sets. This
might cause unexpected results.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 250 observations read from the data set M.REFERENCE_COLUMNS.
NOTE: There were 172 observations read from the data set WORK.APT_RENT_ID.
NOTE: There were 245 observations read from the data set WORK.CAR_PER_PP_ID.
NOTE: There were 245 observations read from the data set WORK.HH_TYPE_ID.
NOTE: There were 320 observations read from the data set WORK.N_HOUSES_ID.
NOTE: There were 245 observations read from the data set WORK.N_MANUFACT_ID.
NOTE: There were 245 observations read from the data set WORK.N_SERVICE_COMP_ID.
NOTE: There were 245 observations read from the data set WORK.N_STUDENTS_PER_CLASS_ID.
NOTE: There were 245 observations read from the data set WORK.N_WELFARE_FACILITY_ID.
NOTE: There were 245 observations read from the data set WORK.RATIO_HEALTH_WELFARE_ID.
NOTE: There were 244 observations read from the data set WORK.URBANICITY_ID.
NOTE: There were 245 observations read from the data set WORK.WORKERS_PER_PP_ID.
NOTE: There were 249 observations read from the data set WORK.DISABILITY_ID.
NOTE: There were 228 observations read from the data set WORK.GDP_ID.
NOTE: There were 250 observations read from the data set WORK.INSURED_POP_ID.
NOTE: There were 228 observations read from the data set WORK.N_EMPLOYEES_ID.
NOTE: There were 228 observations read from the data set WORK.RETAIL_WORKERS_ID.
NOTE: There were 264 observations read from the data set WORK.VAC_RATE_ID.
NOTE: The data set WORK.ALLDATA_ID has 383 observations and 39 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
Step 1: get rid of Excel files for reading data into SAS. Instead convert the files to CSV or another textual format first. Tools like LibreOffice allow batch-conversion from, say, XLSX to CSV. All this, of course, if your data originates from Excel in the first place, which is unlikely.
Step 2: read those files with tailored data steps where you have full control over variable attributes.
> Is there any way to standardize the length of ID1NAME & IDNAME ... would be tedious
Short answer: No there's no non-tedious way.
You have a data quality problem and the source data should be painstakingly standardised by hand.
There is a lazy way out though: Add something like
length id1name idname $20;
at the top of data step.
This does not replace vetting and cleaning the source data.
@Cruise - You should also take note of this:
NOTE: MERGE statement has more than one data set with repeats of BY values.
That means more than one of your input datasets has duplicate BY variable values. The MERGE statement cannot handle this, so the duplicate rows on your second and subsequent datasets will be dropped. Is this what you want?
The many-to-many might be caused by the length issue. Suppose something has an ID code of 12345X, and something else (in a later run) 12345XY. Since after the first import the code will be limited to a length of 6, the Y will be cut off when merging. Like
data have1;
input ID $5.;
datalines;
12345X
12345X
;
data have2;
input ID $6.;
datalines;
12345X
12345XY
;
data want;
merge
have1
have2
;
by id;
run;
THE way for treating this is to fix your import process(es). Variable attributes must be standardized along their documented values when the data is read into SAS. Do NOT use PROC IMPORT and/or Excel files.
Step 1: get rid of Excel files for reading data into SAS. Instead convert the files to CSV or another textual format first. Tools like LibreOffice allow batch-conversion from, say, XLSX to CSV. All this, of course, if your data originates from Excel in the first place, which is unlikely.
Step 2: read those files with tailored data steps where you have full control over variable attributes.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.