values in many datasets

Emma8 · Posted 06-17-2020 02:14 PM

PaigeMiller · Posted 06-17-2020 02:16 PM

You talk about "values" but you seem to be implying it is the variable names that is of concern. Could you please clarify?

If it is variable names, PROC CONTENTS ought to be able to show you what the variable names and then you can compare across data sets.

--
Paige Miller

Emma8 · Posted 06-17-2020 02:34 PM

Shmuel · Posted 06-17-2020 03:37 PM

I'm afraid you can't avoid hard work.

First, check which variables from main data seem to fit which variables from any other data.

Suppose you have common ID and varM of main data seem to fit var2 from data2 and Var3 from data3, then I think that next code can help you decide are the values the same or not:

/* assumed all data sorted by ID */
data to_check;
   marge dataM (keep=ID varM)
              data2  (keep=ID var2)
              data3  (keep=ID var3)
     ;
   by ID;
        flag2 = (varM = var2);  /* flag is 1 for true, 0 for false */
        flag3 = (varM = var3);
run;
proc freq data=to_check;
      table flag2 flag3 / nopercent;
run;

Emma8 · Posted 06-17-2020 03:58 PM

Shmuel · Posted 06-17-2020 08:08 PM

1) Avoid merge many to many by extending ID into several variables like -

ID_number visit_date department (or a like) - until you get unique key.

2) While proc compare is limited to compare two datasets in a time, you can compare

all datasets with all fitting couples of variables using your own MERGE code,

thus creating as many as needed flags within one run

3) Proc compare creates a long output while proc freq of the flags will show exact amount

of equal and non equal values per couple

4) Use KEEP to keep only those variables needed to be checked and

use RENAME to enable comparing and avoiding value overriding in case of

same variable names in two or more datasets.

ballardw · Posted 06-17-2020 02:31 PM

Are these data sets supposed to have the data in the same column order, which might happen when read from an external source with the same order?

Tom · Posted 06-17-2020 04:38 PM

Sounds like you need solve this problem bit by bit instead of trying to attack all in one fell swoop.

Start by explaining the issue more clearly. Use concrete examples with just a few variables and a few observations. You can make up the data from scratch or select a few representative examples from your real data.

Some questions to help figure out what you issue really is.

Do you not know how creatively they changed the names of the variables and are looking for ways to check which variable names might refer to the same type of information? For example does AGE in DATA1 contain the same type of information as AGE_AT_ONSET in DATA2?

Use PROC CONTENTS (or query DICTIONARY.COLUMNS or SASHELP.VCOLUMN) to get a list of all of the variables and check for similar variable names and types.

Do you actually know the variable names in the different datasets and are looking to compare the values? For example are you asking how to compare values of AGE in DATA1 with values of AGE_AT_ONSET in DATA2? Do you know how to match the observations between the datasets?
Use PROC COMPARE.

proc compare data=data1 compare=data2;
  id id visit ;
  var age;
  with age_at_onset;
run;

Emma8 · Posted 06-17-2020 10:27 PM

andreas_lds · Posted 06-18-2020 01:25 AM

@Emma8 wrote:
I have more than 2 datasets. How can do proc compare?

You have to write each proc compare step, because - as you have said earlier - the variable-names are different. Creating lookup-dataset would open the possibility to generate the code, but i would waste time for that only, if i had to do such things regularly.