Solved: Re: Creating subset of data by matching observations to second data se...

Jordan88 · Posted 12-12-2016 02:38 PM

Hi All,

So I am working with two different datasets which have data coming in from multiple facilities. Both datasets report much of the same information, but have some differences (including entries that are available in one, but not the other). We are currently working on switching from using one of these datasets to the other for a number of products, and I am having to perform some QA on them.

I want to create two subsets of the data, where I only pull entries that occur in both datasets (and would also like to do the opposite, to determine which data is appearing in one but not the other and why).

Each entry has a "unique" identifier number, but unfortunately they can appear more than once because each of the facilities generates this number indipendantly of any other facility, and sometimes they end up being the same by chance. Because if this, I can't simply match based on the "unique" identifier alone.

I used the following code to match based on the identifier, but need to tell it to also match the facility WITHIN THE SAME ROW:

proc sql noprint;
	create table SET1_MATCHED as
	select *
	from ATDED.SET1
	where EPISODE_NR in (select EPISODE_NR from SET2)
	by facility
	;
	quit;

proc sql noprint;
	create table SET2_MATCHED as
	select *
	from ATDED.SET2
	where EPISODE_NR in (select EPISODE_NR from SET1)
	by facility
	;
	quit;

Unfortunately, I can't simply add "and FACILITY in(select FACILITY from SET1/2)" because it won't search within the same row, it will just check if the facility is within the other data set at all, and therefore produces identical subsets of the data to ust excluding this line.

Any suggestions?

(Note: I could approach this a little differently and merge the datasets, but would prefer to keep them separate if possible. I could also add in a "and FACILITY = facility_name" and create a subset for each facility and append the sets back together, but there are a lot of facilities and I figured there had to be a more efficient way. ).

Jordan88 · Posted 12-14-2016 06:05 PM

Just in case anyone is actually following this (probably not), here is how I solved both the matching and the non-matching data.

Matching:

proc sql noprint;
	create table SET1_MATCHED as
	select distinct a. *
	from SET1 a, SET2 b
		where a.EPISODE_NR = b.EPISODE_NR 
		and a.FACILITY = b.FACILITY
	;
quit;

(REPEATED FOR SET2)

Non-Matching: It involves 3 steps for each data set. It isn't the most elegent and there is probably a more efficient way to do it, but it works.

	/*Create Identifier Datasets*/
		proc sql noprint;
			create table SET1_IDENTIFIERS as
			select EPISODE_NR, FACILITY from SET1
			;
		quit;

(REPEAT FOR SET2)


		/*Create Identifier Datasets for NON MATCHED DATA*/
		proc sql noprint;
			create table SET1_NONMATCHED_IDENTIFIERS as
			select EPISODE_NR, FACILITY from SET1_IDENTIFIERS
			except 
			select EPISODE_NR, FACILITY from SET2_IDENTIFIERS
			;
		quit; 

(REPEAT FOR SET2)

			Proc sql noprint;
				create table SET1_NOMMATCHED as
				select *
				from SET1_NONMATCHED_IDENTIFIERS as a, SET1 as b
				where a.EPISODE_NR=b.EPISODE_NR and a.FACILITY=b.FACILITY
				;
			quit;

(REPEAT FOR SET2)

View solution in original post

Shmuel · Posted 12-12-2016 03:10 PM

Please post a sample of both datasets that demostarate the problems and use as test data.

Jordan88 · Posted 12-12-2016 03:12 PM

Unfortunately I can't, due to confidentiality issues. If necessary I could create some sample datasets with similar data, it will just take me some time.

ballardw · Posted 12-12-2016 03:29 PM

You might be looking at an Inner Join on EPISODE_NR and Facility to start.

proc sql;
   create table both as
   select a.EPISODE_NR, a.Facility
   from (select EPISODE_NR, Facility from SET1) as a
        inner join
        (select EPISODE_NR, Facility from SET2) as b
        on a.EPISODE_NR=b.EPISODE_NR and a.Facility=b.facility
  ;
quit;

This would give you the Episode_nr and Facility that match in both sets.

Then something like this would yeild the ones that are only in the first set but not the second, and in the second but not the first:

proc sql;
   create table OnlyIn1 as
   select * from(select EPISODE_NR, Facility from SET1)
   except
   select * from both;
   create table OnlyIn2 as
   select * from(select EPISODE_NR, Facility from SET2)
   except
   select * from both;
quit;

If your data contains duplicates of EPISODE_NR for some facilities then you may want to consider adding DISTINCT to most of those selects.

You would now have 3 non-overlapping sets of your identification combinations and you could extract the specific sets as needed (hopefully) using Left or Inner Joins.

Jordan88 · Posted 12-12-2016 03:38 PM

Thanks for the response! I'll give this a shot.

There are some duplicate EPISODE_NR 's within the sets (when wrong information was entered the first time). Will this cause issues? We don't necessarily want to filter them out and I can filter out duplicates later in Tableau using a another variable.

ballardw · Posted 12-12-2016 03:45 PM

The code just pulls identifiers. What I was thinking about was if the Episode_nr occurend many times, like account_number within a bank. There would be the potential to generate lots of duplicates for each match, which is redundant information.

Once you determine that episode XX23 occurs in facility ABC from just data set 1 then you can use that infomation to select ALL the records from data set 1 that have the combination XX23 and ABC.

rogerjdeangelis · Posted 12-12-2016 06:15 PM

This does not yield the records you want but
might be useful for debugging. You may be able to alter it to get records.

Venn diagram analysis of the non unique identifiers in two datasets
Up to 5 dimensional ven diagrams are availiable along with Ian Whitlocks
sets of overlaps algorithm.
May only work with relatively small datasets (less that a million identifiers)

inspired by
https://goo.gl/XC0SKq
https://communities.sas.com/t5/SAS-Data-Management/Creating-subset-of-data-by-matching-observations-to-second-data/m-p/318359


HAVE two datasets with a common but non unique identifiers
======================================================

Up to 40 obs WORK.SET1 total obs=84,151

         EPISODE_
  Obs       NR

    1     250000
    2     250003
    3     250006
    4     250009

    5     250012
    6     250012
....


Up to 40 obs WORK.SET2 total obs=200,818

          EPISODE_
   Obs       NR

     1      1000

     2      1005
     3      1005

     4      1015
     5      1020
     6      1025
     7      1030
     8      1035
....

WANT the venn diagram showing number of


  Set A is                                   Element=EPISODE_NR
  Set B is                                   Element=EPISODE_NR

                   A Union B
                   284,969  Total
                   266,468  Distincts

              SET1           SET2
              84,151        200,818   Totals
              83,334        199,801   Distincts

             *8888888888*     *88888888*
            8            8   8          8
           8              8 8            8
          8                8              8
         8               8  8              8
        8              8     8              8
       8              8       8              8
      8              8         8              8
     8              8           8              8
    8              8             8              8
    8   Distincts  8   Distincts 8  Distincts   8
    8   66,667     8   16,667    8  183,134     8
    8              8             8              8
    8              8             8              8
    8              8             8              8
     8              8           8              8
      8              8         8              8
       8              8       8              8
        8              8     8              8
         8              8   8              8
          8              8 8              8
           8              8              8
            8            8 8            8
             *8888888888*   *8888888888*


%macro utl_ven2
 (
   utitle=Venn diagram for two tables,
   uinmem1=work.EpoNspMqs, /* first  input SAS table */
   uinmem2=work.EpoNspM_f,   /* second input SAS table */
   uvar1=qprikey,   /* field names do not have to be the same */
   uvar2=mprikey
 ) / des="VENN diagram two datasets";

 %put %sysfunc(ifc(%sysevalf(%superq(uinmem1)=,boolean),**** Please Provide 1st SAS dataset ****,));
 %put %sysfunc(ifc(%sysevalf(%superq(uinmem2)=,boolean),**** Please Provide 2nd SAS dataset ****,));
 %put %sysfunc(ifc(%sysevalf(%superq(uvar1)=,boolean),**** Please Provide 1st Keys ****,));
 %put %sysfunc(ifc(%sysevalf(%superq(uvar2)=,boolean),**** Please Provide 2nd Keys ****,));
 %if %eval(
              %sysfunc(ifc(%sysevalf(%superq(uinmem1)=,boolean),1,0)) +
              %sysfunc(ifc(%sysevalf(%superq(uinmem2)=,boolean),1,0)) +
              %sysfunc(ifc(%sysevalf(%superq(uvar1)=,boolean),1,0)) +
              %sysfunc(ifc(%sysevalf(%superq(uvar2)=,boolean),1,0))
              ) eq 0 %then %do;

   /* for testing without macro
     data class1;set sashelp.class;if mod(_n_,2)=0; if _n_=11 then name='Joyce';run;
     data class2;set sashelp.class;if mod(_n_,3)=0; if _n_=1  then name='Major';run;
     %let uinmem1=class1;
     %let uinmem2=class2;
     %let uvar1=name;
     %let uvar2=name;
   */

     %let utl_lvlonea   =%upcase(%sysfunc(ifc(%index(&uinmem1,%str(.))=0,WORK,%upcase(%scan(&uinmem1,1,%str(.))))));
     %let utl_lvltwoa   =%upcase(%sysfunc(ifc(%index(&uinmem1,%str(.))=0,&uinmem1,%upcase(%scan(&uinmem1,2,%str(.))))));
     %let utl_lvlonetwoa= &utl_lvlone..&utl_lvltwo;

     %let utl_lvloneb   =%upcase(%sysfunc(ifc(%index(&uinmem2,%str(.))=0,WORK,%upcase(%scan(&uinmem2,1,%str(.))))));
     %let utl_lvltwob   =%upcase(%sysfunc(ifc(%index(&uinmem2,%str(.))=0,&uinmem2,%upcase(%scan(&uinmem2,2,%str(.))))));
     %let utl_lvlonetwob= &utl_lvlone..&utl_lvltwo;

     %let uinmema = %upcase( &utl_lvltwoa);
     %let uinmemb = %upcase( &utl_lvltwob);
     %let uvara   = %upcase( &uvar1);
     %let uvarb   = %upcase( &uvar2);

   /*----------------------------------------------*\
   |  SQL code to get counts                        |
   |   udsta    = distinct values is set a          |
   |   udstb    = distinct values is set b          |
   |                                                |
   |   unota    = values not in a                   |
   |   unotb    = values not in b                   |
   |                                                |
   |   uaib     = values in intersection            |
   \*----------------------------------------------*/

   proc sql noprint;
     select memlabel , nobs
      into :ulaba , :uina
      from dictionary.tables
      where libname="&utl_lvlonea" and memname="&uinmema";
     select memlabel , nobs
      into :ulabb , :uinb
      from dictionary.tables
      where libname="&utl_lvloneb" and memname="&uinmemb";
    select count(distinct &uvara) into :udsta
      from &uinmema;
    select count(distinct &uvarb) into :udstb
      from &uinmemb;
    select count(distinct &uvara) into :unotb
      from &uinmema
      where &uvara not in
      (select &uvarb as &uvara
       from &uinmemb);
    select count(distinct &uvarb) into :unota
      from &uinmemb
      where &uvarb not in
      (select &uvara as &uvarb
       from &uinmema);
    select count(distinct &uvara) into :uaib
      from &uinmema
      where &uvara in
      (select &uvarb as &uvara
       from &uinmemb);
   quit;
   run;

   %let udstab=%eval(&unota + &unotb + &uaib);   /* total distinct*/
   %let uaub=%eval(&uina + &uinb);               /* a union b     */
   %put unota=&unota;                            /* not in  a     */
   %put unotb=&unotb;                            /* not in  b     */
   %put uaib=&uaib;                              /* a intersect b */
   %put uina=&uina;                              /* in a          */
   %put uinb=&uinb;                              /* in b          */

   data _null_;

    file print;

        uaub=%eval(&uina + &uinb);
        unota=&unota;
        unotb=&unotb;
        udsta=&udsta;
        udstb=&udstb;
        udstab=&udstab;
        uaib=&uaib;
        uina=&uina;
        uinb=&uinb;

    put #02 @10 "Set A is %trim(%left(&ulaba))" @53 "Element=%trim(%left(&uvara))";
    put #03 @10 "Set B is %trim(%left(&ulabb))" @53 "Element=%trim(%left(&uvarb))";
    put #05 @26 " A Union B "@;
    put #06 @22 uaub comma12. @;
    put #07 @22 udstab comma12.@;
    put #08 @20 "&uinmema" @35 "&uinmemb"@;
    put #09 @16 uina  comma12. @31 uinb  comma12.;
    put #10 @16 udsta comma12. @31 udstb comma12.;
    put #12 @3 "                  *8888888888*     *88888888*             "@;
    put #13 @3 "                 8            8   8          8            "@;
    put #14 @3 "                8              8 8            8           "@;
    put #15 @3 "               8                8              8          "@;
    put #16 @3 "              8               8  8              8         "@;
    put #17 @3 "             8              8     8              8        "@;
    put #18 @3 "            8              8       8              8       "@;
    put #19 @3 "           8              8         8              8      "@;
    put #20 @3 "          8              8           8              8     "@;
    put #21 @3 "         8              8             8              8    "@;
    put #22 @3 "         8              8             8              8    "@;
    put #23 @3 "         8              8             8              8    "@;
    put #23 @10 unotb comma12.  @22 uaib comma12. @39 unota comma12. @;
    put #23 @3 "         8"@;
    put #24 @3 "         8              8             8              8    "@;
    put #25 @3 "         8              8             8              8    "@;
    put #26 @3 "         8              8             8              8    "@;
    put #27 @3 "          8              8           8              8     "@;
    put #28 @3 "           8              8         8              8      "@;
    put #29 @3 "            8              8       8              8       "@;
    put #30 @3 "             8              8     8              8        "@;
    put #31 @3 "              8              8   8              8         "@;
    put #32 @3 "               8              8 8              8          "@;
    put #33 @3 "                8              8              8           "@;
    put #34 @3 "                 8            8 8            8            "@;
    put #35 @3 "                  *8888888888*   *8888888888*             "@;
    put #36 @3 "                                                          "@;
   stop;
   run;
 %end;
%mend utl_ven2;

data set1;
  do episode_nr = 250000 to 500000 by 3;
    if uniform(-1)<.01 then output;  /*inject an dup error */
    output;
  end;
run;quit;

data set2;
  do episode_nr = 1000 to 1000000 by 5;
    if uniform(-1)<.005 then output;  /*inject an dup error */
    output;
  end;
run;quit;


%utl_ven2
 (
   uinmem1=set1,
   uinmem2=set2,
   uvar1=episode_nr,
   uvar2=episode_nr
 );

Jordan88 · Posted 12-13-2016 05:32 PM

Alright, so unfortunately @ballardw this code didn't work out quite as I think you or I had expected it to.

It ended up producing a dataset with a little over 33 times as many entries as either of the two datasets from which it was being drawn (over 100 million entries). As you can imagine, the vast majority of these were duplicates.

I tried running a Proc SQL command to remove duplicates, but the set is so large it failed because the SAS work directory ran out of space (meaning the temporary file size got above 30GB).

I created a subset of the data to continue playing around with this until it is figured out.

Currently running the following code (plus the same thing with the data sets reversed):

proc sql noprint;
create table MATCHED1 as select a. *
from SET1 a, SET2 b
where a.EPISODE_NR = b.EPISODE_NR
and a.FACILITY = b.FACILITY
;
quit;

But it's the end of the day so I'll have to wait to find out if it worked when I get back in tomorrow morning.

ballardw · Posted 12-14-2016 11:25 AM

Post some small examples of the data, or something that's similar and what you expect.

Note that I did mention that you may want to use DISTINCT in the selects to identify unique combinations of values. The muliplication of records indicates multiple matches..

Jordan88 · Posted 12-14-2016 01:37 PM

Alright, so I made up two very simplified dummy data sets and attached them here.

I tested the code that I posted yesterday (shown again below), and it works (the matched dataset attached was made using this code). I ended up having to cancel it on the full dataset because it was still running after 18 hours and I had to free up system resources to run other code. But at least I now know that it works.

proc sql noprint;
	create table DUMMY_MATCHED as
	select distinct a. *
	from DUMMY_SET_1 a, DUMMY_SET_2 b
		where a.EPISODE_NR = b.EPISODE_NR 
		and a.FACILITY = b.FACILITY
	;
quit;

Now I just need to figure out how to get non-matching entries (one data set for each input set).

I thought the following code might work, but nope .It pulls most but not all of the non-matching entries, and it also pulls in several matching entries and I can't figure out why. I have attached the resulting non-matched data sets produced by this code.

proc sql noprint;
	create table DUMMY_NON_MATCHED as
	select DISTINCT a. *
	from DUMMY_SET_1 a, DUMMY_SET_2 b
		where (a.EPISODE_NR = b.EPISODE_NR 
		and a.FACILITY NE b.FACILITY) 
                or a.EPISODE_NR NOT IN (select EPISODE_NR from DUMMY_SET_2)
	;
quit;

Jordan88 · Posted 12-14-2016 04:22 PM

I tried another method to find entries that occur in one data set but not the other, and it appears to function exactly as I need it to with the dummy dataset. Unfortunately, it doesn't when I use the full datasets because the placement, name, type, and number of variables differ, so it finds no matches and once again just creates a new dataset containing all the same entries as the original (or fails entirely).

		proc sql noprint;
			create table DUMMY_NON_MATCHED_EXC as
			select * from DUMMY_SET_1
			except 
			select * from DUMMY_SET_2
			;
		quit;

Jordan88 · Posted 12-14-2016 06:05 PM

Just in case anyone is actually following this (probably not), here is how I solved both the matching and the non-matching data.

Matching:

proc sql noprint;
	create table SET1_MATCHED as
	select distinct a. *
	from SET1 a, SET2 b
		where a.EPISODE_NR = b.EPISODE_NR 
		and a.FACILITY = b.FACILITY
	;
quit;

(REPEATED FOR SET2)

Non-Matching: It involves 3 steps for each data set. It isn't the most elegent and there is probably a more efficient way to do it, but it works.

	/*Create Identifier Datasets*/
		proc sql noprint;
			create table SET1_IDENTIFIERS as
			select EPISODE_NR, FACILITY from SET1
			;
		quit;

(REPEAT FOR SET2)


		/*Create Identifier Datasets for NON MATCHED DATA*/
		proc sql noprint;
			create table SET1_NONMATCHED_IDENTIFIERS as
			select EPISODE_NR, FACILITY from SET1_IDENTIFIERS
			except 
			select EPISODE_NR, FACILITY from SET2_IDENTIFIERS
			;
		quit; 

(REPEAT FOR SET2)

			Proc sql noprint;
				create table SET1_NOMMATCHED as
				select *
				from SET1_NONMATCHED_IDENTIFIERS as a, SET1 as b
				where a.EPISODE_NR=b.EPISODE_NR and a.FACILITY=b.FACILITY
				;
			quit;

(REPEAT FOR SET2)

Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

Re: Creating subset of data by matching observations to second data set using multiple variables

The 2025 SAS Hackathon has begun!