topic Re: Standardize length of keys when merging multiple datasets in SAS Programming

Standardize length of keys when merging multiple datasets

Cruise — Tue, 05 May 2020 02:42:53 GMT

Hi Folks:

I'm merging more than twenty survey datasets together with different sources. Is there any way to standardize the length of ID1NAME & IDNAME across all datasets efficiently? Specifying the length of variables for each datasets one by one would be tedious.

Any hints are appreciated.

data alldata_id; /*259*/ 
merge m.reference_columns
Apt_rent_id Car_per_pp_id 
Hh_type_id N_houses_id N_manufact_id 
N_service_comp_id N_students_per_class_id  
N_welfare_facility_id Ratio_health_welfare_id 
Urbanicity_id Workers_per_pp_id Disability_id
Gdp_id Insured_pop_id N_employees_id 
Retail_workers_id Vac_rate_id;
by id1name idname;
run;

But i got the error below:

WARNING: Multiple lengths were specified for the BY variable ID1NAME by input data sets. This
         might cause unexpected results.
WARNING: Multiple lengths were specified for the BY variable IDNAME by input data sets. This
         might cause unexpected results.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 250 observations read from the data set M.REFERENCE_COLUMNS.
NOTE: There were 172 observations read from the data set WORK.APT_RENT_ID.
NOTE: There were 245 observations read from the data set WORK.CAR_PER_PP_ID.
NOTE: There were 245 observations read from the data set WORK.HH_TYPE_ID.
NOTE: There were 320 observations read from the data set WORK.N_HOUSES_ID.
NOTE: There were 245 observations read from the data set WORK.N_MANUFACT_ID.
NOTE: There were 245 observations read from the data set WORK.N_SERVICE_COMP_ID.
NOTE: There were 245 observations read from the data set WORK.N_STUDENTS_PER_CLASS_ID.
NOTE: There were 245 observations read from the data set WORK.N_WELFARE_FACILITY_ID.
NOTE: There were 245 observations read from the data set WORK.RATIO_HEALTH_WELFARE_ID.
NOTE: There were 244 observations read from the data set WORK.URBANICITY_ID.
NOTE: There were 245 observations read from the data set WORK.WORKERS_PER_PP_ID.
NOTE: There were 249 observations read from the data set WORK.DISABILITY_ID.
NOTE: There were 228 observations read from the data set WORK.GDP_ID.
NOTE: There were 250 observations read from the data set WORK.INSURED_POP_ID.
NOTE: There were 228 observations read from the data set WORK.N_EMPLOYEES_ID.
NOTE: There were 228 observations read from the data set WORK.RETAIL_WORKERS_ID.
NOTE: There were 264 observations read from the data set WORK.VAC_RATE_ID.
NOTE: The data set WORK.ALLDATA_ID has 383 observations and 39 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.03 seconds

Re: Standardize length of keys when merging multiple datasets

ChrisNZ — Tue, 05 May 2020 03:13:06 GMT

> Is there any way to standardize the length of ID1NAME & IDNAME ... would be tedious

Short answer: No there's no non-tedious way.

You have a data quality problem and the source data should be painstakingly standardised by hand.

There is a lazy way out though: Add something like

length id1name idname $20;

at the top of data step.

This does not replace vetting and cleaning the source data.

Re: Standardize length of keys when merging multiple datasets

SASKiwi — Tue, 05 May 2020 04:20:09 GMT

@Cruise - You should also take note of this:

NOTE: MERGE statement has more than one data set with repeats of BY values.
That means more than one of your input datasets has duplicate BY variable values. The MERGE statement cannot handle this, so the duplicate rows on your second and subsequent datasets will be dropped. Is this what you want?

Re: Standardize length of keys when merging multiple datasets

Kurt_Bremser — Tue, 05 May 2020 06:10:33 GMT

THE way for treating this is to fix your import process(es). Variable attributes must be standardized along their documented values when the data is read into SAS. Do NOT use PROC IMPORT and/or Excel files.

Re: Standardize length of keys when merging multiple datasets

Cruise — Tue, 05 May 2020 12:54:13 GMT

Thank you so much for the pointer. This means I have to investigate the source of duplicates. Is there a way to know which particular datasets had duplicates that needs my investigation in the data merge process? How did you know that there were duplicates? These are invaluable info to me.

Re: Standardize length of keys when merging multiple datasets

Cruise — Tue, 05 May 2020 12:55:57 GMT

This is what happened. I used PROC IMPORT to import excel variables. Because I couldn't figure out the other way around.

Re: Standardize length of keys when merging multiple datasets

Kurt_Bremser — Tue, 05 May 2020 13:47:23 GMT

Step 1: get rid of Excel files for reading data into SAS. Instead convert the files to CSV or another textual format first. Tools like LibreOffice allow batch-conversion from, say, XLSX to CSV. All this, of course, if your data originates from Excel in the first place, which is unlikely.

Step 2: read those files with tailored data steps where you have full control over variable attributes.

Re: Standardize length of keys when merging multiple datasets

Kurt_Bremser — Tue, 05 May 2020 14:07:55 GMT

The many-to-many might be caused by the length issue. Suppose something has an ID code of 12345X, and something else (in a later run) 12345XY. Since after the first import the code will be limited to a length of 6, the Y will be cut off when merging. Like

data have1;
input ID $5.;
datalines;
12345X
12345X
;

data have2;
input ID $6.;
datalines;
12345X
12345XY
;

data want;
merge
  have1
  have2
;
by id;
run;