Solved: Re: Standardize length of keys when merging multiple datasets

Cruise · Posted 05-04-2020 10:42 PM

Hi Folks:

I'm merging more than twenty survey datasets together with different sources. Is there any way to standardize the length of ID1NAME & IDNAME across all datasets efficiently? Specifying the length of variables for each datasets one by one would be tedious.

Any hints are appreciated.

data alldata_id; /*259*/ 
merge m.reference_columns
Apt_rent_id Car_per_pp_id 
Hh_type_id N_houses_id N_manufact_id 
N_service_comp_id N_students_per_class_id  
N_welfare_facility_id Ratio_health_welfare_id 
Urbanicity_id Workers_per_pp_id Disability_id
Gdp_id Insured_pop_id N_employees_id 
Retail_workers_id Vac_rate_id;
by id1name idname;
run;

But i got the error below:

WARNING: Multiple lengths were specified for the BY variable ID1NAME by input data sets. This
         might cause unexpected results.
WARNING: Multiple lengths were specified for the BY variable IDNAME by input data sets. This
         might cause unexpected results.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 250 observations read from the data set M.REFERENCE_COLUMNS.
NOTE: There were 172 observations read from the data set WORK.APT_RENT_ID.
NOTE: There were 245 observations read from the data set WORK.CAR_PER_PP_ID.
NOTE: There were 245 observations read from the data set WORK.HH_TYPE_ID.
NOTE: There were 320 observations read from the data set WORK.N_HOUSES_ID.
NOTE: There were 245 observations read from the data set WORK.N_MANUFACT_ID.
NOTE: There were 245 observations read from the data set WORK.N_SERVICE_COMP_ID.
NOTE: There were 245 observations read from the data set WORK.N_STUDENTS_PER_CLASS_ID.
NOTE: There were 245 observations read from the data set WORK.N_WELFARE_FACILITY_ID.
NOTE: There were 245 observations read from the data set WORK.RATIO_HEALTH_WELFARE_ID.
NOTE: There were 244 observations read from the data set WORK.URBANICITY_ID.
NOTE: There were 245 observations read from the data set WORK.WORKERS_PER_PP_ID.
NOTE: There were 249 observations read from the data set WORK.DISABILITY_ID.
NOTE: There were 228 observations read from the data set WORK.GDP_ID.
NOTE: There were 250 observations read from the data set WORK.INSURED_POP_ID.
NOTE: There were 228 observations read from the data set WORK.N_EMPLOYEES_ID.
NOTE: There were 228 observations read from the data set WORK.RETAIL_WORKERS_ID.
NOTE: There were 264 observations read from the data set WORK.VAC_RATE_ID.
NOTE: The data set WORK.ALLDATA_ID has 383 observations and 39 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.03 seconds

Kurt_Bremser · Posted 05-05-2020 09:47 AM

Step 1: get rid of Excel files for reading data into SAS. Instead convert the files to CSV or another textual format first. Tools like LibreOffice allow batch-conversion from, say, XLSX to CSV. All this, of course, if your data originates from Excel in the first place, which is unlikely.

Step 2: read those files with tailored data steps where you have full control over variable attributes.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

ChrisNZ · Posted 05-04-2020 11:11 PM

> Is there any way to standardize the length of ID1NAME & IDNAME ... would be tedious

Short answer: No there's no non-tedious way.

You have a data quality problem and the source data should be painstakingly standardised by hand.

There is a lazy way out though: Add something like

length id1name idname $20;

at the top of data step.

This does not replace vetting and cleaning the source data.

High-Performance SAS Coding - Third Edition

SASKiwi · Posted 05-05-2020 12:17 AM

@Cruise - You should also take note of this:

NOTE: MERGE statement has more than one data set with repeats of BY values.
That means more than one of your input datasets has duplicate BY variable values. The MERGE statement cannot handle this, so the duplicate rows on your second and subsequent datasets will be dropped. Is this what you want?

Cruise · Posted 05-05-2020 08:54 AM

Thank you so much for the pointer. This means I have to investigate the source of duplicates. Is there a way to know which particular datasets had duplicates that needs my investigation in the data merge process? How did you know that there were duplicates? These are invaluable info to me.

Kurt_Bremser · Posted 05-05-2020 10:07 AM

The many-to-many might be caused by the length issue. Suppose something has an ID code of 12345X, and something else (in a later run) 12345XY. Since after the first import the code will be limited to a length of 6, the Y will be cut off when merging. Like

data have1;
input ID $5.;
datalines;
12345X
12345X
;

data have2;
input ID $6.;
datalines;
12345X
12345XY
;

data want;
merge
  have1
  have2
;
by id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 05-05-2020 02:10 AM

THE way for treating this is to fix your import process(es). Variable attributes must be standardized along their documented values when the data is read into SAS. Do NOT use PROC IMPORT and/or Excel files.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 05-05-2020 08:55 AM

This is what happened. I used PROC IMPORT to import excel variables. Because I couldn't figure out the other way around.

Kurt_Bremser · Posted 05-05-2020 09:47 AM

Step 1: get rid of Excel files for reading data into SAS. Instead convert the files to CSV or another textual format first. Tools like LibreOffice allow batch-conversion from, say, XLSX to CSV. All this, of course, if your data originates from Excel in the first place, which is unlikely.

Step 2: read those files with tailored data steps where you have full control over variable attributes.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Registration is open

SAS Training: Just a Click Away