BookmarkSubscribeRSS Feed
lady8506
Quartz | Level 8

I've got a dataset in wide format that I'm trying to convert to long format. Over 50 variables have repeat measures that I need to transpose. 

 

In the code I used below, everything seemed to be going fine until I did the second DO loop for Symptoms. I noticed that a particular patient who had 2 Sides listed and 2 Symptoms listed was repeated in the output dataset 4 times. This is the format of the original datset:

 

                Side1        Side2     Symptom1     Symptom2      Symptom3

Patient A     2               1               1                    1                     .

 

This is what happened after the DO loops:

 

                 Side                Symptom

Patient A    2                          1

Patient A    2                          1

Patient A    1                          1

Patient A    1                          1

 

This is what I wanted:

 

                Side                Symptom

Patient A   2                           1

Patient A   1                           1

 

Here is my code:

 

DATA Felibertiloop;
    set Felibertiloop;
    array S1[2] Side1-Side2;
    **Output if all sides are missing;

    IF max(of S1(*))=. then
        output;
    ELSE
        DO I=1 to 2;
            **Only output when sides are not missing;
            Side=S1(i);

            IF not missing(Side) then
                output;
        end;
    drop i Side1 Side2;
RUN;

**Transpose Symptoms columns;

DATA Felibertiloop;
    set Felibertiloop;
    array Sy1[3] Symptoms1-Symptoms3;
    **Output if all symptoms are missing;

    IF max(of Sy1(*))=. then
        output;
    ELSE
        DO I=1 to 3;
            **Only output when sides are not missing;
            Symptom=Sy1(i);

            IF not missing(Symptom) then
                output;
        end;
    drop i Symptoms1 Symptoms2 Symptoms3;
RUN;

 

If anyone can help me figure out how to solve my problem, I'd appreciate it. Thanks!

8 REPLIES 8
Reeza
Super User

Thats the correct output for your code. You have two loops in two different datasteps with TWO output statements. So you get duplicates. 

 

Put it in one data step with ONE output statement. 

PS. I formatted your code and edited your post to make the code legible. It makes it much easier to read. 

 

I strongly advise you not to have the same dataset name in your DATA and SET statement. It makes it very hard to debug code and easy to make mistakes when you code in this manner. 

 

DATA Felibertiloop_Long;
    set Felibertiloop;
    array S1[2] Side1-Side2;
    array Sy1[2] Symptoms1-Symptoms2;
    **Output if all sides are missing;

    IF max(of S1(*))=. then
        output;
    ELSE
        DO I=1 to 2;
            **Only output when sides are not missing;
            Side=S1(i);
            symptom=sy1(i);

            IF not missing(Side) then
                output;
        end;
    drop i Side1 Side2;
RUN;

lady8506
Quartz | Level 8

Reeza, in your example:

 

IF max(of S1(*))=. then
        output;
IF not missing(Side) then
                output;

 Is written only for the Side variable, but in my original code, I'm showing that I need to do this for over 50 variables, not just "Side". Is there a way to write an IF statement such that it does the procedure for over 50 variables (and arrays), not just only Side or S1?

 

Example:

 

IF not missing(Side Symptom ClinSize RadSize TumorSize TumorStage....etc) then output;

 

IF max( of S1 Sy1 ...bunch of other arrays(*)) = . then output;

 

 

Reeza
Super User

Are there any assumptions you can make to simplify this, or can you have missing values anywhere? Or once you encounter a missing do you continue to have that variable in later stages?

 

The code is extendable, but I was working with what you posted, how were you originally planning to extend your process?

 

I think my answer is correct for your original question, this is an extension - which is fine. But, you should post sample data to reflect your actual situation, I can't see your computer, data or code and I don't want to make up data. 

 

If you do have many variables, I woudn't use a data step approach in the first place. 

Use PROC TRANSPOSE instead.

lady8506
Quartz | Level 8

The IF statements are in there because there are missing values. WIthout them, the DO loop duplicates patients unecessarily. 

 

Example:

 

                  Side1      Side2    

Patient B      1              .            

 

And without the IF statements in the DO loop, I get this:

 

                 Side          

Patient B     1

Patient B     .

 

The duplication of Patient B for a missing value is not needed. Also, if the values for a common variable are missing altogether, then the patient gets deleted, which I don't want. Hence, the other IF statement (IF not missing(Sides) then ouput....).

 

I am new to using both PROC TRANSPOSE and DO loops, and with PROC TRANSPOSE, I have the same problem as with the DO loop, a patient will get duplicated anyway for the missing value, which I don't want.

 

 

Reeza
Super User

@lady8506 wrote:

The IF statements are in there because there are missing values. WIthout them, the DO loop duplicates patients unecessarily. 

 

Example:

 

                  Side1      Side2    

Patient B      1              .            

 

And without the IF statements in the DO loop, I get this:

 

                 Side          

Patient B     1

Patient B     .

 

The duplication of Patient B for a missing value is not needed. Also, if the values for a common variable are missing altogether, then the patient gets deleted, which I don't want. Hence, the other IF statement (IF not missing(Sides) then ouput....).

 

I am new to using both PROC TRANSPOSE and DO loops, and with PROC TRANSPOSE, I have the same problem as with the DO loop, a patient will get duplicated anyway for the missing value, which I don't want.

 

 


But what about the rest of the values. If Side is missing are the other values missing? Post a better example of what you have, including all 'edge' cases and what you want. The answers provided work for what you've shown. You're now asking about something you haven't shown. Missing can be accounted for in many ways, but its not clear how you want to do that. 

ballardw
Super User

Here's what I hope isn't a silly question. The data you are showing us was summarized by somebody to get the 1,2,3 or 1,2 versions.

Find out who and aske them for the data that they started with before summarizing. You in effect are really trying to "unsummarize" data.

 

If you look you find on this forum a largish number of posts that indicate creating data such is your starting data is a poor idea for analysis. You are finding out one of the reasons why.

Astounding
PROC Star

You have found only one of the tricky aspects here.  Another would arise if you have different numbers of measurements, like:

 

                Side1        Side2     Symptom1     Symptom2      Symptom3

Patient A     2               1               10                   .                     .

 

It's legitimate to ask in that case what the result should be.

 

I would recommend a program that is a little more cumbersome, but clearly gets you the right answer.  For example:

 

data want;

   set Felibertiloop;

   array narrow {2} side symptom;

   side = side1;

   symptom = symptom1;

   output;

   side = side2;

   symptom = symptom2;

   if max(of narrow{*}) > . then output;

   side = .;

   symptom = symptom3;

   if max(of narrow{*}) > . then output;

   * keep / drop as appropriate;

run;

 

I truly believe it will be easier to write this program than it would be to thoroughly QC the results of a more complex program.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1159 views
  • 2 likes
  • 5 in conversation