BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Manol_Jovani
Fluorite | Level 6

Hello

 

I am using SAS, version 9.4

 

I have the following dataset:

 

data have;

input id IDTYPE CANC TOPO TIME NOMEL INCIDENCE;

datalines;

001 1 1 173 2 1 0

002 1 1 173 3 1 1

003 1 1 173 4 1 0

003 1 1 145 6 0 1

004 3 1 173 1 1 0

004 3 1 184 2 0 1

004 3 1 185 6 0 1

004 3 1 173 9 1 1

005 1 1 122 3 0 1

006 0 . .   . . .

;

run;

 

 

As you can observe, patient 1 (ID=001) and 2 (ID=002) have only one observation. Same for patients 5 and 6, whereas patients 3 and 4 have more than one observation.

 

I would like to eliminate the row with nomel=1 in patients 3 and 4, but not eliminate patients 1 and 2 even though they have also nomel=1.

 

Thus, for patient 3 I would like to eliminate the first row, whereas for patient 4 I would like to eliminate a first and fourth row.

 

I was thinking that the best way to do that would be to create a variable that is the same for all rows with the same ID.

For example, if I create the variable “nomel_del” that would be “1” if nomel=1 and there is more than one observation on the same patient, or “0” otherwise (if there is nomel=1 but there are no other observations on the same patient, as in patients 1 and 2, or if nomel =0 or nomel=.). Then I could easily eliminate all rows with nomel=1 and nomel_del=1.

 

The end result should be following:

 

Obs

id

IDTYPE

CANC

TOPO

TIME

NOMEL

INCIDENCE

1

1

1

1

173

2

1

0

2

2

1

1

173

3

1

1

3

3

1

1

145

6

0

1

4

4

3

1

184

2

0

1

5

4

3

1

185

6

0

1

6

5

1

1

122

3

0

1

7

6

0

.

.

.

.

.

 

 

I am trying to figure out how to do this, but I am not being able to. Would be very grateful if someone can help me.

 

Thank you very much

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Well asked question. Translates directly to SQL:

 

data have;
input id IDTYPE CANC TOPO TIME NOMEL INCIDENCE;
datalines;
001 1 1 173 2 1 0
002 1 1 173 3 1 1
003 1 1 173 4 1 0
003 1 1 145 6 0 1
004 3 1 173 1 1 0
004 3 1 184 2 0 1
004 3 1 185 6 0 1
004 3 1 173 9 1 1
005 1 1 122 3 0 1
006 0 . .   . . .
;

proc sql;
create table many as
select id from have group by id having count(*) > 1;
delete from have
where id in (select id from many) and NOMEL=1;
select * from have;
quit;
PG

View solution in original post

5 REPLIES 5
PGStats
Opal | Level 21

Well asked question. Translates directly to SQL:

 

data have;
input id IDTYPE CANC TOPO TIME NOMEL INCIDENCE;
datalines;
001 1 1 173 2 1 0
002 1 1 173 3 1 1
003 1 1 173 4 1 0
003 1 1 145 6 0 1
004 3 1 173 1 1 0
004 3 1 184 2 0 1
004 3 1 185 6 0 1
004 3 1 173 9 1 1
005 1 1 122 3 0 1
006 0 . .   . . .
;

proc sql;
create table many as
select id from have group by id having count(*) > 1;
delete from have
where id in (select id from many) and NOMEL=1;
select * from have;
quit;
PG
Manol_Jovani
Fluorite | Level 6

Hello

 

Thank you very much for your suggestion.

 

I would like to kindly ask another related question. 

 

 

The dataset I sent above was simply a "training" dataset, just to make sure that I was doing all the necessary and correct exclusions. So, when i wrote that it worked in my comment, I meant that it worked perfectly well in the "training" dataset.

 

The "training" dataset mimics the real dataset, in which each participant (determined by "newid") can have different lines for each cancer that they have (differerent observations; if they have more than one cancer).

 

However, when I tried to apply the code to the real dataset, SAS gets "blocked" (every time it goes in the state of "SAS non responding")

 

Let me explain:

 

The real dataset contains about 5000 observations. 

 

 

I do a first code of exclusion to eliminate all non-melanoma prevalent cancers, with the following code:

 

proc sql;

    create table all4 as

  select * from all3 where newid not in (select distinct newid  from all3  where incidence eq 0 and nomel eq 0) ;

  quit;

 

This code runs perfectly well in less than one minute. If there is an individual (defined by "newid") that has a prevalent cancer, then this codes eliminates all lines that are associated with that "newid" (it eliminates all the observations from that person). In this case "nomel" is non-melanoma skin cancer.

 

Now I try to apply the code that was suggested here to the new dataset created in this case, namely "all4".

 

I try to eliminate here all the "nomel" only in those that have multiple observations, but keep it in those that do not have multiple observations. 

 

I run the following:

 

proc sql;

create table all5 as

select newid from all4 group by newid having count(*) > 1;

delete from all4

where newid in (select newid from all5) and NOMEL=1;

select * from all4;

quit;

 

however, when I run this, SAS gets "blocked". It just does not run ("SAS non responding"). I tried this different times in many different days, but the same problem persists. 

 

At first I assumed that it was a complex operation that just needed time, but I left it running for an hour, and nothing changed. 

 

 

In the "training" dataset that I brought it, it worked perfectly fine and quickly.

 

I do not know why when i try to run it in the real dataset, it creates this problem, since it is not very different from the other "proc sql" code above. 

 

Can you please help me?

 

Thank you very much

 

 

 

PGStats
Opal | Level 21

I can't tell. Try to run each query separately to pinpoint where SAS hangs

 

proc sql;

create table all5 as

select newid from all4 group by newid having count(*) > 1;

quit;

 

Is all5 created? If it is, check its contents and if OK continue with:

 

proc sql;

delete from all4

where newid in (select newid from all5) and NOMEL=1;

quit;

 

Make sure you are not browsing the datasets when you run the queries.

 

 

PG
Manol_Jovani
Fluorite | Level 6

Hello

 

Thank you so much for the answers. I truly appreciate a lot. 

 

For many reasons, I have not been able to work on this project until now. 

 

When a few weeks ago I was running the code on the "training" dataset, I wrote that it worked. 

 

In a way it did. However, even back then I had noticed a problem there that I bypassed, since in that "training" dataset it was still giving the desired results.

 

The problem was that "all5" was not being created, but instead "all4" was being copied, and from it some of the observations were eliminated, as desired.

 

Apparently while in the "training" dataset this did not in the end create real problem since the desired results were achieved, in the real dataset this problem blocks the program from running.

 

In order to explain in detail this point I am including an attachment, with all the steps described in detail, including the results. Please see attached document (seems long, but it is really not, it is so just because I included also the results as Tables).

 

Thank you so very much

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 507 views
  • 1 like
  • 2 in conversation