Solved: Re: Selecting rows where variable has the lowest decimal places

DathanMD · Posted 01-06-2021 11:14 PM

Hi Guys,

I am new to SAS and I would love someone to crack for me this logic?

I have a dataset with patients and weight taken on several days. However, some patients have multiple readings on same day. I would like to keep one row per patient per day. I have concluded that entries with more decimal places are likely to be incorrect. Therefore, where one patient has multiple weights on same day, I would take the one with the lowest number of decimal places. However, where a patient has only one weight on a specific day, I would like to keep the the available weight (regardless of the decimal places). The data I have is table_1 and the data i want is table 2

table_1

patient	mental_score	date	weight	weight_type
19000067	34	15-02-10	65	1
19000067	28	15-02-10	66.1	2
19000068	28	13-03-08	70.12	1
19000068	34	13-03-08	70.123	1
19000068	31	13-03-08	70.02262443	2
19000099	31	23-10-09	81.36	1
19100069	34	08-12-10	71	2
19100069	23	08-12-10	71.2	2
19100069	28	08-12-10	86.26	1
1910070	45	07-06-99	70	1
1910070	45	07-06-99	71.9	3
1910071	45	29-01-98	79.97737557	1
1910076	45	07-06-99	79.97737557	2
1910076	45	07-06-99	70.0226	1
1910567	45	09-02-00	100	2
1910567	45	09-02-00	99.022	1
1910567	45	09-02-00	99.2	2

table_2

patient	mental_score	date	weight	weight_type
19000067	34	15-02-10	65	1
19000068	28	13-03-08	70.12	1
19000099	31	23-10-09	81.36	1
19100069	34	08-12-10	71	2
1910070	45	07-06-99	70	1
1910071	45	29-01-98	79.97737557	1
1910076	45	07-06-99	70.0226	1
1910567	45	09-02-00	100	2

Thank you, a lot,

Dathan Bonanebye

ballardw · Posted 01-07-2021 01:54 AM

Actual example input data in the form of a data step.

Or at least print a before and after result.

Subset the input to some that has the problem. Use a where statement to subset the data to some of the Ids.

Largish sorts may run quicker using the TAGSORT option but that would not affect the First. processing.

This data set:

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;

does not need By processing and appears to likely have a spelling issue between lab_d and lab_v.

Best approach for sharing code, especially if there may be questions about behavior, is to copy the steps or procedures in question from LOG with all the notes and warnings. Then on the forum open a text box using the </> icon that appears at the top of the message window and paste. This will preserve formatting of SAS warning and error messages. The main message windows reformat pasted text if you hadn't noticed.

Does your log for the data step where you add the decimals or select the First.date records show any warning or notes other than "data x was created with xxx observations and yyy varibles"?

Some possible causes:

Trailing characters other than spaces that don't display with your font, i.e the ASCII null or 255, which would make them look sorted but the first/last uses the value differently.

data example;
   x="123456 ";
   /* the "space" below is an ASCII null*/
   /* entered by holding down the ALT key and
      typing 255 on the numeric keypad in 
      windows
   */
   y="123456 ";
   if x=y then put "same";
   else put "different";
run;

Or a FORMAT applied to a character variable that is shorter than the actual values. This can make different values look the same in output

Decimal portions in your dates. If the dates are numeric and you use a SAS date format to make them legible the decimal portion is ignored by the format so the value looks the same to a human but the First/Last uses the numeric value. This is more likely to happen if the data were ever in a spreadsheet, possibly with a datetime value.

data example;
   x=22456;
   y=22456.0001;
   put x=date9. y= date9.;
   if x=y then put "same";
   else put "different";
run;

Your dates aren't actually SAS dates but character values and perhaps you have some confusion from data entry of 0 and O characters, or one, 1, lower case L, l, and or upper case i , I .Or different types of dashes between the date components.

View solution in original post

ballardw · Posted 01-07-2021 12:21 AM

Here is an example of one way to get the number of decimal places.

data junk;
  x= 	70.02262443;
  decimals = length(scan(put(x,best32. -L),2,'.'));
run;

Working from the inside out on that equation: Put (x.best32 -L) creates a left justified (no leading spaces) character value that look like the numeric X. Scan(string,2,'.') says to treat the decimal point as a delimiter and the 2 is to get the bit after the decimal, i.e. 02262773. The Length function then counts the number of characters, i.e. digits, in that string of "decimal" values.

Sort the data by patient, date, decimals.

Then:

Proc sort data=have;/* this data set needs to have the added decimals variable*/
   by patient date decimals;
run;

data want;
   set have;
   by patient date;
   if first.date;
run;

When you use BY processing in a data step SAS creates automatic variables First. and Last. for each variable on the by statement. These automatic variables take on values of 1 (true) or 0(false) indicating whether the current record is the first or last of the group for the given variable. Note the dot between the keyword and variable name and no space allowed.

The If First.date ; keeps only records where the First.date is true. Since we sorted first the smaller number of decimals would be the first for each date within the patients records. If there is only one record it is first and kept.

Caution: if you have missing values for your weight variable the number of decimals will be 0. So if you have such in your data you will need to add logic about how to handle that situation.

If there is only a single record for by variable then both First and Last are true.

DathanMD · Posted 01-07-2021 01:15 AM

Hi Ballardw;

For some reason this doesnt work-I triend same approach earlier before (and I have retried). The output dataset still has multiple entries. I am not sure what the reason could be. My dataset is quite heavy (1 million rows with over 12 columns). Does the by statement become innefficient when sorting by more than 3 variables? I hacve tried also sorting in SAS enterprise guide (keeping first entry for every unique patient and date) and botha approaches still have some patients with multiple entries. Below is my code.

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have;
by patient date decimal;
run;

data want2;
set want;
by patient date;
if first.date;
run;

Any other approah?

Thanks

Dathan

ballardw · Posted 01-07-2021 01:54 AM

Actual example input data in the form of a data step.

Or at least print a before and after result.

Subset the input to some that has the problem. Use a where statement to subset the data to some of the Ids.

Largish sorts may run quicker using the TAGSORT option but that would not affect the First. processing.

This data set:

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;

does not need By processing and appears to likely have a spelling issue between lab_d and lab_v.

Best approach for sharing code, especially if there may be questions about behavior, is to copy the steps or procedures in question from LOG with all the notes and warnings. Then on the forum open a text box using the </> icon that appears at the top of the message window and paste. This will preserve formatting of SAS warning and error messages. The main message windows reformat pasted text if you hadn't noticed.

Does your log for the data step where you add the decimals or select the First.date records show any warning or notes other than "data x was created with xxx observations and yyy varibles"?

Some possible causes:

Trailing characters other than spaces that don't display with your font, i.e the ASCII null or 255, which would make them look sorted but the first/last uses the value differently.

data example;
   x="123456 ";
   /* the "space" below is an ASCII null*/
   /* entered by holding down the ALT key and
      typing 255 on the numeric keypad in 
      windows
   */
   y="123456 ";
   if x=y then put "same";
   else put "different";
run;

Or a FORMAT applied to a character variable that is shorter than the actual values. This can make different values look the same in output

Decimal portions in your dates. If the dates are numeric and you use a SAS date format to make them legible the decimal portion is ignored by the format so the value looks the same to a human but the First/Last uses the numeric value. This is more likely to happen if the data were ever in a spreadsheet, possibly with a datetime value.

data example;
   x=22456;
   y=22456.0001;
   put x=date9. y= date9.;
   if x=y then put "same";
   else put "different";
run;

Your dates aren't actually SAS dates but character values and perhaps you have some confusion from data entry of 0 and O characters, or one, 1, lower case L, l, and or upper case i , I .Or different types of dashes between the date components.

Kurt_Bremser · Posted 01-07-2021 07:06 AM

@DathanMD wrote:

Hi Ballardw;

For some reason this doesnt work-I triend same approach earlier before (and I have retried). The output dataset still has multiple entries. I am not sure what the reason could be. My dataset is quite heavy (1 million rows with over 12 columns). Does the by statement become innefficient when sorting by more than 3 variables? I hacve tried also sorting in SAS enterprise guide (keeping first entry for every unique patient and date) and botha approaches still have some patients with multiple entries. Below is my code.

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have;
by patient date decimal;
run;

data want2;
set want;
by patient date;
if first.date;
run;

Any other approah?

Thanks

Dathan

This code won't work as intended:

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have; /* you sort the original dataset, not the "want" dataset that contains the new variable */
by patient date decimal;
run;
data want2;
set want;
by patient date;
if first.date;
run;

This code was tested with your data as posted and worked:

data with_length;
set have;
l = lengthn(scan(put(weight,best32.),2,"."));
run;

proc sql;
create table want as
  select
    patient,
    mental_score,
    date,
    weight,
    weight_type
  from with_length
  group by patient, date
  having l = min(l)
;
quit;

This worked also, creating the same result:

proc sort data=with_length;
by patient date l;
run;

data want2;
set with_length;
by patient date;
if first.date;
run;

If this does not do it for you, post the complete log from your code by copy/pasting into a window opened with this button:

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Selecting rows where variable has the lowest decimal places

Re: Selecting rows where variable has the lowest decimal places

Re: Selecting rows where variable has the lowest decimal places

Re: Selecting rows where variable has the lowest decimal places

Re: Selecting rows where variable has the lowest decimal places

Re: Selecting rows where variable has the lowest decimal places

Ready to join fellow brilliant minds for the SAS Hackathon?