BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi All,

 

I often have a dataset is (I think) first normal form.  So say it's a long skinny dataset with of patient visits.  And the dataset has patient-level variables (sex, eyecolor) and visit-level data.  I expect the patient-level variables to be constant within a patient ID, and want to confirm that is true.  I'm not sure what feels like the "best" approach to this data cleaning exercise.  In a manner that will be reasonably efficient for largish datasets, and support multiple ID vars (complex keys) and character vars.

 

So suppose have below, and want to detect that SEX is not constant within ID for ID=2:

data have;
  input id sex eyecolor visitdate score;
  cards;
1 1 1 1 10
1 1 1 2 20
1 1 1 3 30
2 1 2 1 10
2 1 2 2 20
2 2 2 3 30
;
run;

 

For quick-and-dirty check, I will sometimes do:

proc sort data=have;
by id;
run;
data _null_; set have; by id sex eyecolor; if first.eyecolor and not (first.id) then put "ERROR: inconsistent value " (id sex eyecolor visitdate)(=); run;

 

That will throw an error to the log if sex or eyecolor vary within id, either from the IF statement or the BY statement error first if the data aren't sorted by sex and eyecolor.

 

But of course if I see an error, what I really want is to look at the all the records where there is a conflict for that ID.  So for that I could use a double-DOW loop, like:

 

proc sort data=have;
  by id sex eyecolor;
run;

data want;
  do until (last.id);
    set have;
    by id sex eyecolor;
    if first.eyecolor and not (first.id) then do;
put "ERROR: inconsistent value "  (id sex eyecolor visitdate)(=);
_flag=1;
end; end; do until (last.id); set have; by id; if _flag=1 then output; end; drop _flag; run;

 

But for big datasets, that sort can get really expensive, particularly if there are lots of ID vars and lots of attribute vars I am trying to check.  And for every variable added to the BY statement, SAS has to add first.x and last.x to the PDV. 

 

So then I was thinking of SQL approaches, and thought maybe:

 

proc sql;
  create table want as
  select b.*
  from 
    (
     select id
     from 
       (
        select distinct id, sex, eyecolor
        from have
        )
     group by id
     having count(*)>1
     ) as a
    ,have as b 
  where a.id=b.id
 ;
quit;

Which I haven't tested to see how it would do on big data.

 

 

I suppose if I only had numeric vars for the attributes, the SQL could be even easier, something like:

proc sql;
  create table want as
  select *
  from have
  group by id
  having range(sex) ne 0 or range(eyecolor) ne 0
 ;
quit;

I guess RANGE() would ignore missing values, so would need to think that through a bit.

 

 

So all that said, wanted to throw this out to the group to see if folks have approaches they like.  This check seems to come up for me fairly often, and my guess is each time I approach it a different way, so I'm thinking of macro-izing an approach.

 

Thanks,

-Q.

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
@Quentin,
You are in the right way, if you like to use SQL.

proc sql;
  create table ERROR as
  select *
  from have
  group by id
  having count(distinct catx(' ', id,sex,eyecolor)) gt 1
 ;
quit;


View solution in original post

10 REPLIES 10
ballardw
Super User

I would try something like:

 

proc sql;

   create table Patients as

   select distinct patientid, sex, dob, eyecolor,   <whatever variables should be the same>

   from have;

quit;

 

proc freq data=patients noprint;

    tables patientid / out=dups(where=(count>1));

run;

 

The dataset out would have a list of the patient ids that have one or more characteristic change. The count value would tell you how many records are involved.

NOTE: determining the correct values is an entire other problem especially if you have 2 records and 2 values for something.

 

And I do have recurring data where people change date of birth, race, gender and ethnicity. So I do understand this problem moderately well.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, I suppose my first question would be, why is it possible to have multiple values of constants in your data?  We have to handle data simliar to this regularly as well.  First rule of entered data (i.e. databases) is only collect the data in one place - i.e. gender on demography only.  If a data element appears in multiple places then its a data management task to perform cross checks on that data.  If the data is from a thrid party vendor, e.g. lab data, then as part of the data import process, data cleaning activities should be performed, e.g. cors check demo, visits matching etc.  These are the cleaning activites and should result in a clean database for analysis.  Generally speaking now, I only work with CDISC models, where this DM processing is already done.

Quentin
Super User

@RW9 agree that is the right question, and that's the question I always ask.

 

I've never been in a position to trust data I receive.  So when starting on a new project with new source data, I will often start by checking that the data make sense to me.  And if the data don't make sense (e.g. conflichts as above), then I will make a little sample of the problems and sit down with the owner/provider of the data to disucss how these problems occurred, how they could be prevented, how to resolve them if not possible to be prevented, etc

 

So yes, this is definitely a data cleaning activity.  And just about every one of my projects starts with such data cleaning, regardless of whether it is an official data mangement project or a statistical analysis.

 

Also even if I "know" data will be clean (because I'm tempted to trust the source database or whatever), I typically leave assertions in place in production code to catch any surprise errors that appear in the future, in case someone changes their database design and doesn't tell me.

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
RW9
Diamond | Level 26 RW9
Diamond | Level 26

If thats the case, then yes, your probably best off creating some sort of simple list of checks and running it through some macro code. 

What about normalising all the data into a simple parameter/response dataset, so some generic tool to take a dataset and for each column create a parameter response, so from your test data create:

ID    DOMAIN    PARAMETER   RESULT

1      DM            SEX                  1

...

 

Once you have all your data in that kind of format it should simply by a matter of doing a proc sql; group by id,domain,parameter where count > 1.

 

Ksharp
Super User
@Quentin,
You are in the right way, if you like to use SQL.

proc sql;
  create table ERROR as
  select *
  from have
  group by id
  having count(distinct catx(' ', id,sex,eyecolor)) gt 1
 ;
quit;


Quentin
Super User

Thanks @Ksharp that's pretty nifty!  I will have to test this with my real data, to see if I run into problems with the max length of CATX, and losing precision from numeric->character conversions.  But I like this idea.

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Ksharp
Super User
Quentin,

If you have a big table, I wouldn't expect SQL could get you very fast.
I would try proc sort + nudupkey + nouniquekey + Hash Table:

proc sort data=have out=key nodupkey noequal sortsize=max;
 by id sex eyecolor;
run;
proc sort data=key out=HashTable nouniquekey noequal sortsize=max;
 by id;
run;

............Hash Table here.............

Quentin
Super User

Thanks @Ksharp I hadn't seen NOUNIQUEKEY before.

 

That approach makes me think of doing the the same two checks as your sorts, but doing them in in datastep.  Below I think will rerturn all of the conflicts:

 

proc sort data=have;
  by id sex eyecolor;
run;

data v1/view=v1;
  set have;
  by id sex eyecolor;
  if first.eyecolor;
run;

data want;
  set v1;
  by id;
  if not (first.id and last.id);
  put (id sex eyecolor)(=);
run;
BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Ksharp
Super User
Quentin,
Yeah. Both are the same.

Why not use DOW ? Maybe it is faster than anyone of these. Who knows.

data have;
  input id sex eyecolor visitdate score;
  cards;
1 1 1 1 10
1 1 1 2 20
1 1 1 3 30
2 1 2 1 10
2 1 2 2 20
2 2 2 3 30
;
run;


data ERROR;
n=0;
 do until(last.id);
  set;
  by id sex eyecolor;
  n+first.eyecolor;
 end;
 do until(last.id);
  set;
  by id sex eyecolor;
  if n ne 1 then output;
 end;
 
drop n;
run;

Quentin
Super User

Yes, that's similar to the DOW approach I posted in the original question.  I like it, only concern was the need to sort the data by all three variables.  But I suppose some of the SQL solutions are likely sorting as well. I guess to completely avoid the sort requirement, you could use a hash approach.  Below is the best I could come up with on a Saturday morning, but I'm sure there are better hash approaches.  It ises two double DOW loop , and two hash tables.

 

data want;
  if 0 then set have;
  if _N_ = 1 then do;   
    declare hash hid(suminc: 'Count');  *Unique IDs;
    hid.defineKey('id');  
    hid.defineDone();

    declare hash hidConstant(); *Unique by ID and constant vars;
    hidConstant.defineKey('id','sex', 'eyecolor');
    hidConstant.defineDone();
  end;

  count=1;
  do until(eof);
    set have end=eof;
    if hid.add() ne 0 and hidConstant.add()=0 then do; *if (dup by ID) and (not dup by constant vars); 
      rc=hid.ref();  *Found a conflict, increment the counter;
    end;
  end;

  do until(eof2);
    set have end=eof2;
    hid.sum(sum: Total);
    if Total>1 then output want;
  end;

  drop rc;
run;

 

Usually my data would be sorted by ID, but not necessarily by SEX and EYECOLOR.  So if willing to assume that data is pre-sorted by ID, then the hash approach is simpler.  Still two passes of the data, but only one hash table, and can clear it after each BY-group.

 

data want;
  if 0 then set have;
  if _N_ = 1 then do;   
    declare hash hidConstant(); *Unique by ID and constant vars;
    hidConstant.defineKey('id','sex', 'eyecolor');
    hidConstant.defineDone();
  end;

  do _n=1 by 1 until(last.id);
    set have;
    by id;

    if _n > 1 and hidConstant.add()=0 then do; *if dup by ID but not dup by constant vars;
      _Conflict=1;
    end;
  end;
hidConstant.clear(); do until(last.id); set have; by id; if _Conflict then output want; end; run;

 

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1272 views
  • 4 likes
  • 4 in conversation