BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DathanMD
Obsidian | Level 7

Hi Guys,

I am new to SAS and I would love someone to crack for me this logic?

I have a dataset with patients and weight taken on several days. However, some patients have multiple readings on same day. I would like to keep one row per patient per day. I have concluded that entries with more decimal places are likely to be incorrect. Therefore, where one patient has multiple weights on same day, I would take the one with the lowest number of decimal places. However, where a patient has only one weight on a specific day, I would like to keep the the available weight (regardless of the decimal places). The data I have is table_1 and the data i want is table 2

 

table_1

 

patient

mental_score

date

weight

weight_type

19000067

34

15-02-10

65

1

19000067

28

15-02-10

66.1

2

19000068

28

13-03-08

70.12

1

19000068

34

13-03-08

70.123

1

19000068

31

13-03-08

70.02262443

2

19000099

31

23-10-09

81.36

1

19100069

34

08-12-10

71

2

19100069

23

08-12-10

71.2

2

19100069

28

08-12-10

86.26

1

1910070

45

07-06-99

70

1

1910070

45

07-06-99

71.9

3

1910071

45

29-01-98

79.97737557

1

1910076

45

07-06-99

79.97737557

2

1910076

45

07-06-99

70.0226

1

1910567

45

09-02-00

100

2

1910567

45

09-02-00

99.022

1

1910567

45

09-02-00

99.2

2

 

table_2

 

patient

mental_score

date

weight

weight_type

19000067

34

15-02-10

65

1

19000068

28

13-03-08

70.12

1

19000099

31

23-10-09

81.36

1

19100069

34

08-12-10

71

2

1910070

45

07-06-99

70

1

1910071

45

29-01-98

79.97737557

1

1910076

45

07-06-99

70.0226

1

1910567

45

09-02-00

100

2

 

Thank you, a lot,

 

Dathan Bonanebye

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Actual example input data in the form of a data step.

Or at least print a before and after result.

 

Subset the input to some that has the problem. Use a where statement to subset the data to some of the Ids.

Largish sorts may run quicker using the TAGSORT option but that would not affect the First. processing.

This data set:

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;

does not need By processing and appears to likely have a spelling issue between lab_d and lab_v.

Best approach for sharing code, especially if there may be questions about behavior, is to copy the steps or procedures in question from LOG with all the notes and warnings. Then on the forum open a text box using the </> icon that appears at the top of the message window and paste. This will preserve formatting of SAS warning and error messages. The main message windows reformat pasted text if you hadn't noticed.

 

 

Does your log for the data step where you add the decimals or select the First.date records show any warning or notes other than "data x was created with xxx observations and yyy varibles"?

 

Some possible causes:

Trailing characters other than spaces that don't display with your font, i.e the ASCII null or 255, which would make them look sorted but the first/last uses the value differently.

data example;
   x="123456 ";
   /* the "space" below is an ASCII null*/
   /* entered by holding down the ALT key and
      typing 255 on the numeric keypad in 
      windows
   */
   y="123456 ";
   if x=y then put "same";
   else put "different";
run;

Or a FORMAT applied to a character variable that is shorter than the actual values. This can make different values look the same in output

Decimal portions in your dates. If the dates are numeric and you use a SAS date format to make them legible the decimal portion is ignored by the format so the value looks the same to a human but the First/Last uses the numeric value. This is more likely to happen if the data were ever in a spreadsheet, possibly with a datetime value.

data example;
   x=22456;
   y=22456.0001;
   put x=date9. y= date9.;
   if x=y then put "same";
   else put "different";
run;

 

 

Your dates aren't actually SAS dates but character values and perhaps you have some confusion from data entry of 0 and O characters, or one, 1, lower case L, l, and or upper case i , I .Or different types of dashes between the date components.

 

 

View solution in original post

4 REPLIES 4
ballardw
Super User

Here is an example of one way to get the number of decimal places.

data junk;
  x= 	70.02262443;
  decimals = length(scan(put(x,best32. -L),2,'.'));
run;

Working from the inside out on that equation: Put (x.best32 -L) creates a left justified (no leading spaces) character value that look like the numeric X. Scan(string,2,'.') says to treat the decimal point as a delimiter and the 2 is to get the bit after the decimal, i.e. 02262773. The Length function then counts the number of characters, i.e. digits, in that string of "decimal" values.

 

Sort the data by patient, date, decimals.

Then:

Proc sort data=have;/* this data set needs to have the added decimals variable*/
   by patient date decimals;
run;

data want;
   set have;
   by patient date;
   if first.date;
run;

When you use BY processing in a data step SAS creates automatic variables First. and Last. for each variable on the by statement. These automatic variables take on values of 1 (true) or 0(false) indicating whether the current record is the first or last of the group for the given variable. Note the dot between the keyword and variable name and no space allowed.

The If First.date ; keeps only records where the First.date is true. Since we sorted first the smaller number of decimals would be the first for each date within the patients records. If there is only one record it is first and kept.

 

Caution: if you have missing values for your weight variable the number of decimals will be 0. So if you have such in your data you will need to add logic about how to handle that situation.

 

If there is only a single record for by variable then both First and Last are true.

DathanMD
Obsidian | Level 7

Hi Ballardw;

 

For some reason this doesnt work-I triend same approach earlier before (and I have retried). The output dataset still has multiple entries. I am not sure what the reason could be.  My dataset is quite heavy (1 million rows with over 12 columns). Does the by statement become innefficient when sorting by more than 3 variables? I hacve tried also sorting in SAS enterprise guide (keeping first entry for every unique patient and date) and botha approaches still have some patients with multiple entries. Below is my code.


data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have;
by patient date decimal;
run;

data want2;
set want;
by patient date;
if first.date;
run;

 

Any other approah?

 

Thanks

 

Dathan

ballardw
Super User

Actual example input data in the form of a data step.

Or at least print a before and after result.

 

Subset the input to some that has the problem. Use a where statement to subset the data to some of the Ids.

Largish sorts may run quicker using the TAGSORT option but that would not affect the First. processing.

This data set:

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;

does not need By processing and appears to likely have a spelling issue between lab_d and lab_v.

Best approach for sharing code, especially if there may be questions about behavior, is to copy the steps or procedures in question from LOG with all the notes and warnings. Then on the forum open a text box using the </> icon that appears at the top of the message window and paste. This will preserve formatting of SAS warning and error messages. The main message windows reformat pasted text if you hadn't noticed.

 

 

Does your log for the data step where you add the decimals or select the First.date records show any warning or notes other than "data x was created with xxx observations and yyy varibles"?

 

Some possible causes:

Trailing characters other than spaces that don't display with your font, i.e the ASCII null or 255, which would make them look sorted but the first/last uses the value differently.

data example;
   x="123456 ";
   /* the "space" below is an ASCII null*/
   /* entered by holding down the ALT key and
      typing 255 on the numeric keypad in 
      windows
   */
   y="123456 ";
   if x=y then put "same";
   else put "different";
run;

Or a FORMAT applied to a character variable that is shorter than the actual values. This can make different values look the same in output

Decimal portions in your dates. If the dates are numeric and you use a SAS date format to make them legible the decimal portion is ignored by the format so the value looks the same to a human but the First/Last uses the numeric value. This is more likely to happen if the data were ever in a spreadsheet, possibly with a datetime value.

data example;
   x=22456;
   y=22456.0001;
   put x=date9. y= date9.;
   if x=y then put "same";
   else put "different";
run;

 

 

Your dates aren't actually SAS dates but character values and perhaps you have some confusion from data entry of 0 and O characters, or one, 1, lower case L, l, and or upper case i , I .Or different types of dashes between the date components.

 

 

Kurt_Bremser
Super User

@DathanMD wrote:

Hi Ballardw;

 

For some reason this doesnt work-I triend same approach earlier before (and I have retried). The output dataset still has multiple entries. I am not sure what the reason could be.  My dataset is quite heavy (1 million rows with over 12 columns). Does the by statement become innefficient when sorting by more than 3 variables? I hacve tried also sorting in SAS enterprise guide (keeping first entry for every unique patient and date) and botha approaches still have some patients with multiple entries. Below is my code.


data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have;
by patient date decimal;
run;

data want2;
set want;
by patient date;
if first.date;
run;

 

Any other approah?

 

Thanks

 

Dathan


This code won't work as intended:

data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have; /* you sort the original dataset, not the "want" dataset that contains the new variable */
by patient date decimal;
run;
data want2;
set want;
by patient date;
if first.date;
run;

This code was tested with your data as posted and worked:

data with_length;
set have;
l = lengthn(scan(put(weight,best32.),2,"."));
run;

proc sql;
create table want as
  select
    patient,
    mental_score,
    date,
    weight,
    weight_type
  from with_length
  group by patient, date
  having l = min(l)
;
quit;

This worked also, creating the same result:

proc sort data=with_length;
by patient date l;
run;

data want2;
set with_length;
by patient date;
if first.date;
run;

 

If this does not do it for you, post the complete log from your code by copy/pasting into a window opened with this button:

Bildschirmfoto 2020-04-07 um 08.32.59.jpg

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 860 views
  • 1 like
  • 3 in conversation