- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hey there, spending way too much time trying to find a workaround to this issue but is time to ask for support.
I created a macro that transposed data from a wide format (variables names in columns, values in rows) to a long one (variable name in one column, values as columns per year and sequence). It works very well when values are unique, but failed for some datasets as there are multiple observations for same ID and YEAR_SEQUENCE. It does not really failed but got unwanted results. This is due to the LET function in PROC TRANSPOSE that allows the macro to continue (with warnings) but the desired output is not what I need as only the last row per ID is outputted.
The main goal is to flag when there was a change in data values between the sequence numbers (e.g. 1 if there was a change between sequence 1 and sequence 2, or sequence 2 and sequence 3).
This is a dummy example:
As you can see, for ID 2, Var_A has values STEP_A and STEP_B for same period (year and sequence), likewise VAR_B has two values for same period.
Any help will be pretty much appreciated.
data table;
input ID $ ACC_DATE SEQUENCE VAR_A $ VAR_B ;
datalines;
1 202201 1 STEP_B 350
1 202201 2 STEP_B 500
1 202201 3 STEP_B 700
2 202201 1 STEP_A 20
2 202201 1 STEP_B 5
2 202201 2 STEP_A 20
2 202201 2 STEP_B 5
2 202201 3 STEP_A 20
2 202201 3 STEP_B 5 ; run;
PROC SORT DATA= TABLE FORCE; BY ID ACC_DATE SEQUENCE ; RUN;
proc transpose data = work.Table out = work.Table_T1 name= VARIABLE PREFIX=_ DELIM=WD let ;
by ID; var var_a var_b;
id ACC_DATE SEQUENCE; run;
want
ID | Variable | _202201_1 | _202201_2 | _202201_3 |
1 | VAR_A | STEP_B | STEP_B | STEP_B |
1 | VAR_B | 350 | 500 | 700 |
2 | VAR_A | STEP_A | STEP_A | STEP_A |
2 | VAR_A | STEP_B | STEP_B | STEP_B |
2 | VAR_B | 20 | 40 | 80 |
2 | VAR_B | 5 | 10 | 10 |
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Should the values of VAR_B in the last 4 rows of the initial data table match those in your "want" table?
If so, then the only way I could think to do this (other than transposing with the DATA step) is to add an additional SEQ_CNT variable for proc transpose to use as a BY variable.
data table;
input ID $ ACC_DATE SEQUENCE VAR_A $ VAR_B ;
datalines;
1 202201 1 STEP_B 350
1 202201 2 STEP_B 500
1 202201 3 STEP_B 700
2 202201 1 STEP_A 20
2 202201 1 STEP_B 5
2 202201 2 STEP_A 40
2 202201 2 STEP_B 10
2 202201 3 STEP_A 80
2 202201 3 STEP_B 10
;
run;
PROC SORT DATA= TABLE FORCE; BY ID ACC_DATE SEQUENCE; RUN;
* create SEQ_CNT variable;
data table;
set table;
by ID ACC_DATE SEQUENCE;
if first.SEQUENCE then SEQ_CNT=0;
SEQ_CNT+1;
run;
PROC SORT DATA= TABLE FORCE; BY ID SEQ_CNT; RUN;
proc transpose data=work.Table out=work.Table_T1 (drop=SEQ_CNT) name=VARIABLE PREFIX=_ DELIM=_ let;
by ID SEQ_CNT;
var var_a var_b;
id ACC_DATE SEQUENCE;
run;
proc sort data=Table_T1;
by ID VARIABLE;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Should the values of VAR_B in the last 4 rows of the initial data table match those in your "want" table?
If so, then the only way I could think to do this (other than transposing with the DATA step) is to add an additional SEQ_CNT variable for proc transpose to use as a BY variable.
data table;
input ID $ ACC_DATE SEQUENCE VAR_A $ VAR_B ;
datalines;
1 202201 1 STEP_B 350
1 202201 2 STEP_B 500
1 202201 3 STEP_B 700
2 202201 1 STEP_A 20
2 202201 1 STEP_B 5
2 202201 2 STEP_A 40
2 202201 2 STEP_B 10
2 202201 3 STEP_A 80
2 202201 3 STEP_B 10
;
run;
PROC SORT DATA= TABLE FORCE; BY ID ACC_DATE SEQUENCE; RUN;
* create SEQ_CNT variable;
data table;
set table;
by ID ACC_DATE SEQUENCE;
if first.SEQUENCE then SEQ_CNT=0;
SEQ_CNT+1;
run;
PROC SORT DATA= TABLE FORCE; BY ID SEQ_CNT; RUN;
proc transpose data=work.Table out=work.Table_T1 (drop=SEQ_CNT) name=VARIABLE PREFIX=_ DELIM=_ let;
by ID SEQ_CNT;
var var_a var_b;
id ACC_DATE SEQUENCE;
run;
proc sort data=Table_T1;
by ID VARIABLE;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Pechan_Uvas You should be aware that when you provide data as data step with Datalines or Cards that if you place the semicolon that ends the data block on a line with data that the line with the semicolon is not in the resulting dataset.
So please make sure that your data steps look like:
data table;
input ID $ ACC_DATE SEQUENCE VAR_A $ VAR_B ;
datalines;
1 202201 1 STEP_B 350
1 202201 2 STEP_B 500
1 202201 3 STEP_B 700
2 202201 1 STEP_A 20
2 202201 1 STEP_B 5
2 202201 2 STEP_A 20
2 202201 2 STEP_B 5
2 202201 3 STEP_A 20
2 202201 3 STEP_B 5
;
Note that the Run is not needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want to generate multiple observations per ID then you need some other variable that can distinguish the observations that go together.
data for_transpose ;
set table;
by id acc_date sequence ;
row+1;
if first.sequence then row=1;
run;
proc sort;
by id row;
run;
proc transpose data = for_transpose out = Table_T1 name= VARIABLE DELIM=_ prefix=X_;
by ID row;
id ACC_DATE SEQUENCE;
var var_a var_b;
run;
proc sort;
by id variable row;
run;
proc print data=table_t1;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This solution also works very well. Never seen the use of row+1 in a BY statement. Thanks for proposing a different approach.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Pechan_Uvas wrote:
This solution also works very well. Never seen the use of row+1 in a BY statement. Thanks for proposing a different approach.
ROW+1;
Is a SUM statement. It has nothing directly to do with the BY statement. It is the IF statement that is using the BY statement because it is referencing the FIRST. variable created by the BY statement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Pechan_Uvas wrote:
The main goal is to flag when there was a change in data values between the sequence numbers (e.g. 1 if there was a change between sequence 1 and sequence 2, or sequence 2 and sequence 3).
data table; input ID $ ACC_DATE SEQUENCE VAR_A $ VAR_B ; datalines; 1 202201 1 STEP_B 350 1 202201 2 STEP_B 500 1 202201 3 STEP_B 700 2 202201 1 STEP_A 20 2 202201 1 STEP_B 5 2 202201 2 STEP_A 20 2 202201 2 STEP_B 5 2 202201 3 STEP_A 20 2 202201 3 STEP_B 5 ; run; PROC SORT DATA= TABLE FORCE; BY ID ACC_DATE SEQUENCE ; RUN; proc transpose data = work.Table out = work.Table_T1 name= VARIABLE PREFIX=_ DELIM=WD let ; by ID; var var_a var_b; id ACC_DATE SEQUENCE; run;
want
ID Variable _202201_1 _202201_2 _202201_3 1 VAR_A STEP_B STEP_B STEP_B 1 VAR_B 350 500 700 2 VAR_A STEP_A STEP_A STEP_A 2 VAR_A STEP_B STEP_B STEP_B 2 VAR_B 20 40 80 2 VAR_B 5 10 10
If your goal is to flag data values being different, transposing is not needed (and results in a horrible and difficult to use formatting of the table). You can use the LAG function in data step, without transposing, to obtain flags when data values are different.
Paige Miller