BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Andalusia
Obsidian | Level 7

note: Im using SAS EG version 7.15

I have a dataset which looks like below(please note that in reality its much bigger!). 

(char)  (num)     (char)
type     quarter   phase
K-11     202001    1
K-11     202101    2
K-11     202003          
K-12     202101    3
K-12     202102          
K-12     202002    1
K-13     202002    2
K-13     202103    3
K-13     202104    2

I want to achieve the following:
1. I want to place the values of quarters as horizontal columns and fill their values with the values of phase. So my desired output would be this:

 

type    202001 202002 202003 202004 202101 202102 202103 202104
K-11     1                          2       
K-12            1                   3 
K-13            2                                  3       2

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Sorry, but your dataset is simply dumb.

Storing a number which you need for calculation as character is stupidity #1. Numbers are numbers are numbers, and if you want to make a calculation from them, you STORE THEM THAT WAY, period.

Second, a date-related values HAS TO BE STORED AS SUCH in SAS, otherwise you deprive yourself completely of using all the nice tools SAS provides for dealing with such values. A SAS date value is numeric, contains a count of days starting at 1960-01-01, and has a date format attached.

 

For the intended result:

A wide layout is needed for two reasons: running a regression (or similar) analysis that needs lots of categorical values in one observation, or as a report for human consumption. For anything you want to use in further data processing, you use a long layout.

 

The phase differences are calculated like that:

data have;
infile datalines truncover;
input type $ quarter :yyq6. phase;
format quarter yyqn6.;
datalines;
K-11 2020Q01 1
K-11 2021Q01 2
K-11 2020Q03
K-12 2021Q01 3
K-12 2021Q02 1
K-12 2020Q02 1
K-13 2020Q02 2
K-13 2021Q03 3
K-13 2021Q04 2
;

proc sort data=have;
by type quarter;
run;

data want;
set have;
where phase ne .;
by type;
retain diff_phase;
if first.type then diff_phase = phase;
if last.type
then do;
  diff_phase = phase - diff_phase;
  output;
end;
diff_phase = phase;
keep type diff_phase;
run;

To create the wide dataset, you need this:

proc sql;
select min(quarter), max(quarter) into :minq, :maxq from have;
quit;

data template;
quarter = &minq.;
do until (quarter gt &maxq.);
  output;
  quarter = intnx('quarter',quarter,1);
end;
run;

data long;
set
  template
  have
;
run;

proc transpose
  data=long
  out=wide (
    drop=_name_
    where=(type ne "")
  )
  prefix=q_
;
by type;
var phase;
id quarter;
run;

data want_wide;
merge
  wide
  want
;
by type;
run;

The template is needed to create the order of columns for the tranpose.

All this is done easier in PROC REPORT:

data pre_report / view=pre_report;
merge
  have
  want
;
by type;
run;

proc report data=pre_report;
column type phase,quarter diff_phase;
define type / group;
define phase / "" analysis;
define quarter / "" across order=formatted;
define diff_phase / group;
run;

View solution in original post

21 REPLIES 21
Kurt_Bremser
Super User

PROC REPORT does this:

data have;
input type $ quarter :yymmn6. phase;
format quarter yymmn6.;
datalines;
K-11     202001    1
K-11     202101    2
K-11     202003           
K-12     202101    3
K-12     202102           
K-12     202002    1
K-13     202002    2
K-13     202103    3
K-13     202104    2
;

proc report data=have;
column type phase,quarter;
define type / group;
define phase / "" analysis;
define quarter / "" across order=formatted;
run;
Andalusia
Obsidian | Level 7

@Kurt_Bremser 

 

I tried your snippet on my data set and I got this:

ERROR: phase is an ANALYSIS variable but not numeric.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE REPORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

 

Kurt_Bremser
Super User

Then we need to add a little twist to the REPORT procedure:

data have;
infile datalines truncover;
input type $ quarter :yymmn6. phase $;
format quarter yymmn6.;
datalines;
K-11     202001    1
K-11     202101    2
K-11     202003           
K-12     202101    3
K-12     202102           
K-12     202002    1
K-13     202002    2
K-13     202103    3
K-13     202104    2
;

proc report data=have;
column type phase,quarter n;
define type / group;
define phase / "" display;
define quarter / "" across order=formatted;
define n / noprint;
run;

By posting your data as a data step yourself, you prevent us from making mistakes with regards to the recreation of your data.

Andalusia
Obsidian | Level 7
@Kurt_Bremser. Thank you. It seems like it worked, however its a report and I cant do anything with it after it. I want it to be a dataset so I can query with it.

With regarding to your note on preparing data step myself. I tried to do that I even followed the blog of Jedi SAS Tricks but I could not figure out how it works.
Kurt_Bremser
Super User

A structure like you showed is typical for reports, for datasets it is very unfavorable, as it makes further coding harder.

Maxim 19: Long Beats Wide.

So what do you want to do with your data further on?

 

Look at my two DATALINES examples, they contain character, numeric and date values. Most of what you need to know to prepare simple datasets yourself.

Andalusia
Obsidian | Level 7

@Kurt_Bremser I ran your transpose snippet and received:

 

ERROR: The ID value "K-11" occurs twice in the same BY group.
ERROR: Too many bad BY groups.

 

What I want further with my data: I want to calculate the difference between the last quarter and the second last quarter and place that in a new column. After that I wan that dataset to be a LASR table to create some graphs with it.

With regards to your two DATALINES. I will use that as a reference for future questions. Thank you.

Kurt_Bremser
Super User

@Andalusia wrote:

@Kurt_Bremser I ran your transpose snippet and received:

 

ERROR: The ID value "K-11" occurs twice in the same BY group.
ERROR: Too many bad BY groups.

 

What I want further with my data: I want to calculate the difference between the last quarter and the second last quarter and place that in a new column. After that I wan that dataset to be a LASR table to create some graphs with it.

With regards to your two DATALINES. I will use that as a reference for future questions. Thank you.


Cannot happen with the data you posted:

data have;
infile datalines truncover;
input type $ quarter :yymmn6. phase $;
format quarter yymmn6.;
datalines;
K-11     202001    1
K-11     202101    2
K-11     202003           
K-12     202101    3
K-12     202102           
K-12     202002    1
K-13     202002    2
K-13     202103    3
K-13     202104    2
;

proc transpose
  data=have
  out=want (drop=_name_)
  prefix=q_
;
by type;
var phase;
id quarter;
run;

Log:

 73         data have;
 74         infile datalines truncover;
 75         input type $ quarter :yymmn6. phase $;
 76         format quarter yymmn6.;
 77         datalines;
 
 NOTE: The data set WORK.HAVE has 9 observations and 3 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.00 seconds
       
 
 87         ;
 88         
 89         proc transpose
 90           data=have
 91           out=want (drop=_name_)
 92           prefix=q_
 93         ;
 94         by type;
 95         var phase;
 96         id quarter;
 97         run;
 
 NOTE: There were 9 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT has 3 observations and 8 variables.
 NOTE:  Verwendet wurde: PROZEDUR TRANSPOSE - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.02 seconds


And you POSITIVELY do NOT need to transpose. You can easily compare values across observations by using a RETAINed variable, or by joining the dataset with itself. This is all much easier done with a vertical dataset.

What "difference" do you want to calculate? phase is character in your dataset, so it's not suited for calculations.

 

Please post a comprehensive example of your dataset (containing "edge cases") in usable form, and the expected outcome.

Andalusia
Obsidian | Level 7

@Kurt_Bremser Oke I tried to give you a comprehensive example of my dataset (containing "edge cases"). 

 

data have;
infile datalines truncover;
input type $ quarter phase $;
datalines;
K-11 202001 1
K-11 202101 2
K-11 202003
K-12 202101 3
K-12 202102 1
K-12 202002 1
K-13 202002 2
K-13 202103 3
K-13 202104 2
;

My quarter column is just a numeric, that needs to stay like that. Oké below is what I want as a dataset (not a report!):

type    202001 202002 202003 202004 202101 202102 202103 202104  diff_phase 
K-11     1                          2                             1        
K-12            1                   3      1                      -2 
K-13            2                                  3      2       -1 

@Kurt_Bremser I want all the columns above so I guess a transpose is necessary?. In diff_phase I calculate the difference between the last quarter and the second last quarter for every type.(So 2-1 for K-11, 1-3 for K-12 and 2-3 for K-13)

Really glad if you could help me solve this problem 🙂

Kurt_Bremser
Super User

Sorry, but your dataset is simply dumb.

Storing a number which you need for calculation as character is stupidity #1. Numbers are numbers are numbers, and if you want to make a calculation from them, you STORE THEM THAT WAY, period.

Second, a date-related values HAS TO BE STORED AS SUCH in SAS, otherwise you deprive yourself completely of using all the nice tools SAS provides for dealing with such values. A SAS date value is numeric, contains a count of days starting at 1960-01-01, and has a date format attached.

 

For the intended result:

A wide layout is needed for two reasons: running a regression (or similar) analysis that needs lots of categorical values in one observation, or as a report for human consumption. For anything you want to use in further data processing, you use a long layout.

 

The phase differences are calculated like that:

data have;
infile datalines truncover;
input type $ quarter :yyq6. phase;
format quarter yyqn6.;
datalines;
K-11 2020Q01 1
K-11 2021Q01 2
K-11 2020Q03
K-12 2021Q01 3
K-12 2021Q02 1
K-12 2020Q02 1
K-13 2020Q02 2
K-13 2021Q03 3
K-13 2021Q04 2
;

proc sort data=have;
by type quarter;
run;

data want;
set have;
where phase ne .;
by type;
retain diff_phase;
if first.type then diff_phase = phase;
if last.type
then do;
  diff_phase = phase - diff_phase;
  output;
end;
diff_phase = phase;
keep type diff_phase;
run;

To create the wide dataset, you need this:

proc sql;
select min(quarter), max(quarter) into :minq, :maxq from have;
quit;

data template;
quarter = &minq.;
do until (quarter gt &maxq.);
  output;
  quarter = intnx('quarter',quarter,1);
end;
run;

data long;
set
  template
  have
;
run;

proc transpose
  data=long
  out=wide (
    drop=_name_
    where=(type ne "")
  )
  prefix=q_
;
by type;
var phase;
id quarter;
run;

data want_wide;
merge
  wide
  want
;
by type;
run;

The template is needed to create the order of columns for the tranpose.

All this is done easier in PROC REPORT:

data pre_report / view=pre_report;
merge
  have
  want
;
by type;
run;

proc report data=pre_report;
column type phase,quarter diff_phase;
define type / group;
define phase / "" analysis;
define quarter / "" across order=formatted;
define diff_phase / group;
run;
Andalusia
Obsidian | Level 7

@Kurt_Bremser  This is a quiet complex situation, I dont know what is going wrong in my code. Can you see it? Also I can see that you changed the datelines and added the Q in it. I cannot do that and I also don't want to change my source data. 

This is my code:

proc sort data=work.filtered_main_dataset;
by type quarter;
run;

data want;
set work.filtered_main_dataset;
where c_phase ne .;
by type;
retain diff_phase;
if first.type then diff_phase = c_phase;
if last.type
then do;
  diff_phase = c_phase - diff_phase;
  output;
end;
diff_phase = c_phase;
keep type diff_phase;
run;

data long;
set
  work.filtered_main_dataset
;
run;

proc transpose
  data=long
  out=wide (
    drop=_name_
    where=(type ne "")
  )
  prefix=q_
;
by type;
var c_phase;
id quarter;
run;

data want_wide;
merge
  wide
  want
;
by type;
run;

 

 

This is my log:

2 The SAS System 23:15 Thursday, April 15, 2021
44
45 data long;
46 set
47 work.filtered_main_dataset
48 ;
49 run;

NOTE: There were 7238 observations read from the data set WORK.FILTERED_MAIN_DATASET.
NOTE: The data set WORK.LONG has 7238 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

50
51 proc transpose
52 data=long
53 out=wide (
54 drop=_name_
55 where=(type ne "")
56 )
57 prefix=q_
58 ;
59 by type;
60 var c_phase;
61 id quarter;
62 run;

ERROR: The ID value "q_201904" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B013
ERROR: The ID value "q_201902" occurs twice in the same BY group.
ERROR: The ID value "q_201904" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B014
ERROR: The ID value "q_201904" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B017
ERROR: The ID value "q_201904" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B020
ERROR: The ID value "q_201904" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B022
ERROR: The ID value "q_202004" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B025
ERROR: The ID value "q_202004" occurs twice in the same BY group.
ERROR: The ID value "q_202004" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B028
ERROR: The ID value "q_201904" occurs twice in the same BY group.
ERROR: The ID value "q_202004" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B030
ERROR: The ID value "q_201904" occurs twice in the same BY group.
3 The SAS System 23:15 Thursday, April 15, 2021

NOTE: The above message was for the following BY group:
type=TWK-B035
ERROR: The ID value "q_202001" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
type=TWK-B037
ERROR: The ID value "q_201903" occurs twice in the same BY group.
ERROR: Too many bad BY groups.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 269 observations read from the data set WORK.LONG.
WARNING: The data set WORK.WIDE may be incomplete. When this step was stopped there were 0 observations and 0 variables.
WARNING: Data set WORK.WIDE was not replaced because this step was stopped.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

63


64 data want_wide;
65 merge
66 wide
67 want
68 ;
69 by type;
70 run;

WARNING: Multiple lengths were specified for the BY variable type by input data sets. This might cause unexpected results.
NOTE: There were 3 observations read from the data set WORK.WIDE.
NOTE: There were 988 observations read from the data set WORK.WANT.
NOTE: The data set WORK.WANT_WIDE has 991 observations and 12 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

71
72
73
74
75 GOPTIONS NOACCESSIBLE;
76 %LET _CLIENTTASKLABEL=;
77 %LET _CLIENTPROCESSFLOWNAME=;
78 %LET _CLIENTPROJECTPATH=;
79 %LET _CLIENTPROJECTPATHHOST=;
80 %LET _CLIENTPROJECTNAME=;
81 %LET _SASPROGRAMFILE=;
82 %LET _SASPROGRAMFILEHOST=;
83
84 ;*';*";*/;quit;run;
85 ODS _ALL_ CLOSE;
86
87
88 QUIT; RUN;


This is how the first 28 rows of my filtered_main_dataset looks like:

Andalusia_0-1618526451509.png

 

 

Kurt_Bremser
Super User

Let's begin at the start: how do you receive your data? It may be that the unfavorable data types are the consequence of an ill-designed import process.

 

Next, there is no reason to continue working with dumb data if you receive it as such. While you process the data in SAS, you always strive for the most easy to handle form, which means treating dates/times as such and so on. If you need to re-export processed data to the source, you can always convert to the original data types then, but you do not make your own life harder just because someone else is a masochist. Or an idiot.

 

Next, implement Maxim 3 (Know Your Data). You have values for type that have more than one entry for a given quarter. Before being able to transpose/report on such data, you need to decide how to handle these multiple entries (select min, max, sum or average).

Andalusia
Obsidian | Level 7

Im getting this data from an external data source. Oke to solve my problem I think I need to change my quarter data to your quarter data(with that Q). How can that be done easily then?

With regards to your Maxim 3. I want to take jus the last quarter if that quarter occurs more then once in a type. How can I achieve that?

Kurt_Bremser
Super User

In which firm does the data arrive? Direct unload from a DBMS, text file, Excel?

This will determine where you make the conversion to proper variable types and values.

Show the code you use to get the data into SAS.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 21 replies
  • 4000 views
  • 2 likes
  • 3 in conversation