turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Datastepping - beyond my skills..

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2012 10:09 AM

Hi everyone!

I have a issue that i hope you can help me to resolve. I am trying to map out a newspapers

I have a large dataset. Over 70000 rows with approx 57 variables. The test dataset attached is a sample of 10000 randomly created, but with realistic outcomes.

This is the output that I get, and then work with in Excel. My assignment is to map a flow between these different categories. An example could be ID 4835:

Between weeks 1 and 2, hence change 2, he opened to a four week trial and therefore has the “OPENED to TRIAL_4W”. This trial runs for 4 weeks, and from week 5 to week 6, he closes and gets a “CLOSED from TRIAL_4W” mark. Seven weeks later, he opens to a 13 week trial period, and so on…

Now, what I would like SAS to do, is what I do manually in excel with pivottabels and its driving me up the wall..:

Let’s say I want to check how many of the four week trials that opened in January (change2, change3, change4), are still active 8 weeks later and in what category they are active. Then I would filter change2 to “OPENED to TRIAL_4W” and copy change8 (change2 + 8 weeks = change10), and then do a pivot table to count how they are distributed. I would repeat this procedure for change3 to change11, change4 to change12, then add up the 3 weeks (3 weeks because I don’t have 2011 data), and THEN I would have January.

As you might guess, this is a very tedious thing to do, and my errors keep on growing when doing manual excel slavework like this, simply because I have no idea how to do it in SAS.

In a perfect world, I would like a piece of SAS-code that:

- Looks at month Y tells me how many opened to different subscriptions.
- Conditioned on opening in month Y, how are they distributed over different subscriptions X weeks later?
- How many weeks they had a “blank” before opening to a new subscription?

If this is possible, I think that I would be able to build something that answered the rest of the questions I have. The only problem is, I have no idea where to start.

The code that generates the output is this:

**data** c;

set b;

retain count;

by ID;

if first.ID then count = **0**;

count = count + **1**;

if numabb = " " then advarsel = **1**;

**run**;**quit**;

**data** rotate;

set c;

by ID;

retain numabb1-numabb16;

array Anumabb(**1**:**16**) numabb1-numabb16;

if first.ID then do;

do i = **1** to **16**;

Anumabb* = ***.**;

end;

end;

Anumabb(week) = numabb;

if last.ID then output;

drop i;

**run**;**quit**;

**data** rotate_1;

set rotate;

array Anumabb(**16**) numabb1-numabb16;

do i = **1** to **16**;

if Anumabb* = *' ' then Anumabb* = ***0**;

end;

drop i;

**run**;**quit**;

**data** pattern;

set rotate_1;

array Anumabb(**16**) numabb1-numabb16;

array AChange(**2**:**16**) $ **45** change2-change16;

do w = **2** to **16**;

The sampledata is attached. I sincerely hope you can help!

- Toby

Accepted Solutions

Solution

04-29-2012
09:38 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-29-2012 09:38 AM

T,

I think I have the right idea here, and that it will be easy enough for you to adjust if not. This approach saves all the formatting for later on, but finds the initial promotional status, and the initial subsequent subscription status. It works with your final data set that already has transposed the data, but just as easily could be redone to work with the earlier format to the data with multiple records per ID.

You may need to adjust some of the ranges I used, like 16 to 20 = trial subscription. But you're most familiar with what the codes mean and which should be grouped.

data want;

set have;

array anumabb {16} numabb1 - numabb16;

do week=1 to 16 until (week_started_trial > .);

if (16 <= anumabb{week} <= 20) then do;

week_started_trial = week;

starting_trial_status = anumabb{week};

end;

end;

* May want to set a default, if there was never a trial;

if (0 < week_started_trial < 16) then

do week=week_started_trial+1 to 16 until (week_started_subscription > .);

if (1 <= anumabb{week} <= 15) then do;

week_started_subscription = week;

starting_subscription_status = anumabb{week};

end;

end;

keep id week_started_trial week_started_subscription starting_trial_status starting_subscription_status;

run;

There are lots of ways to format a table from this point. First, let's see if this is pulling out the information you need (or at least most of it).

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-25-2012 11:22 AM

Toby,

Here is some high-level guidance. I'm hoping you can work out the specifics.

The data set PATTERN contains all the necessary elements. Even the arrays that might facilitate programming had been defined. (Note that array definitions are temporary, so the ARRAY statements will need to appear in any new programs you write.)

Your mission will be to decide on the outcomes you would like to measure, and set up a new variable for each outcome. (New variables representing outcomes will probably take on values of 0/1, but that is not a requirement if something else works better.) Use a DATA step to add new variables to each row, measuring the outcomes for just that row. When you're done, use another tool (such as PROC FREQ) on the resulting data set to count how often each outcome occurs.

As a general rule, if you could look at the data on paper and figure out what to do, there will be a way to tell SAS how to do it. I hope this gives you enough guidance to work out the SAS-based approach.

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-25-2012 11:41 AM

Toby, I, and I assume many others like me, will not open an .XLS file found on a forum (even a reputable one such as this one) for well known security reasons. I would feel safer opening a text-only file such as .csv.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-26-2012 03:16 AM

Here is a piece of code , based on my understanding .

proc import datafile='c:\sampledata.xls' out=have dbms=excel replace;

getnames=yes;run;

data January ;

set have;

where (change2 ="OPENED to TRIAL_4W" and change10 is not missing ) or

(change3 ="OPENED to TRIAL_4W" and change11 is not missing ) or

(change4 ="OPENED to TRIAL_4W" and change12 is not missing ) ;

run;

Ksharp

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-26-2012 04:06 AM

Hi, and thanks for the anwsers.

@ Astounding - if i were to use a dummy method, the number of combinations over time would overwhelm me, and i would end up with ALOT of dummys for each time period. Down below is the code i use to categorize the different subscribtion directions.

@ PGStats, noted. Thanks. I've attached the sampledata as a .csv file! The SAS forums zips it for some reason, I hope it's OK?

@ Ksharp - Thanks for your input. That would probably work, but that would only give me a partial anwser to question two, and only for one month. I know that i can just take february data and do it the same way, but if you open the sampledata, you'll realize that the number of different combinations are quite large.

Here's the code i use for categorizing the diffrent subscribtions. This covers ALL of the possible movement, and as you can see, there are quite a few possible combinations.

data pattern;

set rotate_1;

array Anumabb(16) numabb1-numabb16;

array AChange(2:16) $ 45 change2-change16;

do w = 2 to 16;

*if Anumabb[w-1] > 0 and Anumabb[w-1] = Anumabb

******************************************* BEVÆGELSER ****************************************;

* NORMAL;

if Anumabb[w-1] = 1 and Anumabb

if Anumabb[w-1] = 2 and Anumabb

if Anumabb[w-1] = 3 and Anumabb

if Anumabb[w-1] = 4 and Anumabb

if Anumabb[w-1] = 5 and Anumabb

if Anumabb[w-1] = 6 and Anumabb

if Anumabb[w-1] = 7 and Anumabb

if Anumabb[w-1] = 8 and Anumabb

* KAMPAGNE;

if Anumabb[w-1] = 9 and Anumabb

if Anumabb[w-1] = 10 and Anumabb

if Anumabb[w-1] = 11 and Anumabb

if Anumabb[w-1] = 12 and Anumabb

if Anumabb[w-1] = 13 and Anumabb

if Anumabb[w-1] = 14 and Anumabb

* TRIAL;

if Anumabb[w-1] = 15 and Anumabb

if Anumabb[w-1] = 16 and Anumabb

if Anumabb[w-1] = 17 and Anumabb

if Anumabb[w-1] = 18 and Anumabb

* FREE;

if Anumabb[w-1] = 19 and Anumabb

if Anumabb[w-1] = 20 and Anumabb

* MIDLERTIDIGE LUKNINGER;

if Anumabb[w-1] = 21 and Anumabb

********************************* ÅBENEDE *************************************;

* NORMAL;

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

* KAMPAGNE;

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

* TRIAL;

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

* FREE;

if Anumabb[w-1] = 0 and Anumabb

if Anumabb[w-1] = 0 and Anumabb

* MIDLERTIDIGE LUKNINGER;

if Anumabb[w-1] = 0 and Anumabb

********************************* LUKKEDE *************************************;

* NORMAL;

if Anumabb[w-1] = 1 and Anumabb

if Anumabb[w-1] = 2 and Anumabb

if Anumabb[w-1] = 3 and Anumabb

if Anumabb[w-1] = 4 and Anumabb

if Anumabb[w-1] = 5 and Anumabb

if Anumabb[w-1] = 6 and Anumabb

if Anumabb[w-1] = 7 and Anumabb

if Anumabb[w-1] = 8 and Anumabb

* KAMPAGNE;

if Anumabb[w-1] = 9 and Anumabb

if Anumabb[w-1] = 10 and Anumabb

if Anumabb[w-1] = 11 and Anumabb

if Anumabb[w-1] = 12 and Anumabb

if Anumabb[w-1] = 13 and Anumabb

if Anumabb[w-1] = 14 and Anumabb

* TRIAL;

if Anumabb[w-1] = 15 and Anumabb

if Anumabb[w-1] = 16 and Anumabb

if Anumabb[w-1] = 17 and Anumabb

if Anumabb[w-1] = 18 and Anumabb

* FREE;

if Anumabb[w-1] = 19 and Anumabb

if Anumabb[w-1] = 20 and Anumabb

* MIDLERTIDIGE LUKNINGER;

if Anumabb[w-1] = 21 and Anumabb

******************************************* KONVERTERING FRA NORMAL ****************************************;

* NORMAL TIL NORMAL;

if Anumabb[w-1] = 5 and 5 < Anumabb

* MIDLERTIDIGE LUKNINGER;

if 0 < Anumabb[w-1] < 9 and Anumabb

******************************************* KONVERTERING FRA KAMPAGNER ****************************************;

* KAMPAGNE TIL NORMAL;

if Anumabb[w-1] = 9 and 0 < Anumabb

if Anumabb[w-1] = 10 and 0 < Anumabb

if Anumabb[w-1] = 11 and 0 < Anumabb

if Anumabb[w-1] = 12 and 0 < Anumabb

if Anumabb[w-1] = 13 and 0 < Anumabb

if Anumabb[w-1] = 14 and 0 < Anumabb

* TRIAL TIL NORMAL;

if Anumabb[w-1] = 15 and 0 < Anumabb

if Anumabb[w-1] = 16 and 0 < Anumabb

if Anumabb[w-1] = 17 and 0 < Anumabb

if Anumabb[w-1] = 18 and 0 < Anumabb

* MIDLERTIDIGE LUKNINGER;

if 8 < Anumabb[w-1] < 15 and Anumabb

if 14 < Anumabb[w-1] < 19 and Anumabb

******************************************* KONVERTERING TIL KAMPAGNE ****************************************;

* HVERVE;

if Anumabb[w-1] = 19 and Anumabb

if Anumabb[w-1] = 20 and Anumabb

if Anumabb[w-1] = 19 and Anumabb

if Anumabb[w-1] = 20 and Anumabb

if Anumabb[w-1] = 19 and Anumabb

if Anumabb[w-1] = 20 and Anumabb

* 4 UGERS;

if Anumabb[w-1] = 11 and 8 < Anumabb

* 10 UGERS;

if Anumabb[w-1] = 16 and Anumabb

if Anumabb[w-1] = 12 and Anumabb

end;

drop w;

run;quit;

Im not really looking for code to tell me how or where the subscribtions are at different times. I need some help to create some code, that can:

- Conditioned on one time period (month) how many opened, and X weeks later, how many of them are still active and how are they distributed over different subscribtions.

Basicly the same questions as in my original post.

Hope this clears things up a bit..

T

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-26-2012 11:09 AM

Toby,

Yes, that helps clear things up. Now let's clear them up the rest of the way, I'll try to construct an example of what you are looking for. If you comment about my assumptions and questions below, there are many posters here who would finish the job (myself included).

Starting with the data set created above, a macro would specify ...

One time period. By taking on a number from 1 to 16, this would specify which variable (numabb1 - numabb16) should be used.

The number of weeks later. In combination with the specified time period, this would tell you which variable (change2 - change16) should be used. For example, if the time period is 5, and the number of weeks later is 4, use change9.

The final result in this case would be a table showing the distribution of change9. But does the time period variable get used in other ways? Added to the title? Other than examining change9, are there other ways to determine which subscriptions are still active? Should the final table utilize just observations and omit others?

Understanding the problem is more than half the battle in this case.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-26-2012 11:14 AM

I would reshape the data to long and use a simple data step to extract relevant observations before tabulating. The useful coding pattern in the data step is the "Double-DoW" (search sas-l like this for some examples). Below I am assuming that you have imported your excel sheet into a dataset, work.one:

/* reshape to long */

**data** two;

length id week numabb **8** change $40.;

set one;

change1 = repeat(" ", **40**-**1**);

array num[**1**:**16**] numabb1-numabb16;

array chg[**1**:**16**] $ change1-change16;

do week = **1** to **16**;

numabb = num[week];

change = chg[week];

output;

keep id week numabb change;

end;

**run**;

/* what happened to those with four week trials opened in weeks 1 to 4

after 8 (or upto the available) weeks later?

if there are two or more trial_4w^s opened in the period, then

we take the latest one. See id=7098 */

**data** three;

/* double DoW */

opened = **0**;

do until (last.id);

set two;

by id;

if (**1**<=week<=**4** and change="OPENED to TRIAL_4W") then

opened = week;

end;

do until (last.id);

set two;

by id;

if opened>**0** and week = opened + min(opened+**8**, **16**) then

output;

end;

**run**;

**proc** **freq** data=three;

tables change/list missing;

**run**;

/* on lst

The FREQ Procedure

Cumulative Cumulative

change Frequency Percent Frequency Percent

------------------------------------------------------------------------

228 86.36 228 86.36

CAMPAIGN_10W to CAMPAIGN_10W 1 0.38 229 86.74

CAMPAIGN_13W to CAMPAIGN_13W 1 0.38 230 87.12

CLOSED from FREE_05 1 0.38 231 87.50

CLOSED from TRIAL_13W 1 0.38 232 87.88

FREE_05 to FREE_05 1 0.38 233 88.26

MIDL to MIDL 1 0.38 234 88.64

NORMAL_13W to NORMAL_13W 20 7.58 254 96.21

NORMAL_52W to NORMAL_52W 2 0.76 256 96.97

NORMAL_KVARTAL to NORMAL_KVARTAL 3 1.14 259 98.11

OPENED to CAMPAIGN_13W 1 0.38 260 98.48

OPENED to NORMAL_13W 1 0.38 261 98.86

TRIAL_13W to TRIAL_13W 2 0.76 263 99.62

TRIAL_26W to TRIAL_26W 1 0.38 264 100.00

*/

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-26-2012 01:48 PM

hi ... I used your CSV file and hopefully I understand the counts that you want ...

*** make a data set;**

**proc import datafile='z:\sampledata.csv' out=xxx;****delimiter=';';****run;**

*** get rid of all those formats/informats added by IMPORT;**

**proc datasets lib=work nolist;****modify xxx;****format _all_;****informat _all_;****quit;**

*** find START in January (change2-change4) ... if START in January, find STOP;**

**data jan;****set xxx;****array change(2:16) change2-change16;**

*** did they start in January;****start = whichc('OPENED to TRIAL_4W',change2,change3,change4) + 1;****if start gt 1;**

*** when did they stop;****stop = whichc('CLOSED from TRIAL_4W',of change(*)) + 1;****run;**

output attached ... within each January week (START 2, 3, 4) ... if STOP = 1, no CLOSED was found in the CHANGE variables

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to MikeZdeb

04-27-2012 05:05 AM

@ Astounding

* The timeperiod variable, week or month, is when the subscribtion is started. But since change2 describes the change between week one, and week two, it's really not nessesary.

* It dosen't get added to the title, that's not really nessesary.

* A active subscribtion is any cell that isen't blank, or has a "CLOSED from X" mark.

* Optimally, something like the result that **chang_y_chung@hotmail.com** gets, but for all combinations and running from month to month (alternativliv week to week). I'll need to look at the Double-DoW and how the reshapeing to long looks. But the output looks promising.

Basicly i need to find out how the flow in subscriptions is. So if X amount of people open to a 10 week campaign, how many of them open to a 13 week normal subscription anytime later.Saying that, having a fixed number of weeks that i need to look ahead (like from change4 to change9) is not a satisfactory solution.

I might have a bunch of TRIAL_4W that opens in change2, then distributed over mabye 6-7 weeks, they will open to other subscribtions. I need to figure out a way to calculate the average number of weeks they are closed (blank) before they open to a new subscribtion, and then what subscribtion they open to. And then again, how many weeks they, on average, hold that subscribtion.

@ chang_y_chung - Thanks a bunch! I think i understand the transformation to long. But i'll have to look at the Double-DoW method. The output you got is near perfect. But i still need to have a "dynamic" timerange, so that is isen't locked on to 4 weeks necessarily.

@ MikeZdeb - Thanks for the input! Im not really sure that i fully understand your output? Im not really interested in how many subscribtions close, but rather how many of the ones who opened, are still active X weeks later.. I hope that my comments above clears up my point abit more.

@ Ksharp - Just saw your post. I need to read some more about the _a and dim(_a)-8, as i'm not really sure what that does. Again, as commented in this post, i would like to have a "dynamic" timelimit, so i don't have to decide on a fixed number of weeks, but rather "Who opened in week1, how are they distributed in the weeks that i have info about now (weeks 1 to 16), how long were they blank before they opened, and what did they open to?

Thanks alot for the input, it's really helpful. Finding help to SAS datastep on google, is a bit difficult when im not even sure where to start.

T

Solution

04-29-2012
09:38 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-29-2012 09:38 AM

T,

I think I have the right idea here, and that it will be easy enough for you to adjust if not. This approach saves all the formatting for later on, but finds the initial promotional status, and the initial subsequent subscription status. It works with your final data set that already has transposed the data, but just as easily could be redone to work with the earlier format to the data with multiple records per ID.

You may need to adjust some of the ranges I used, like 16 to 20 = trial subscription. But you're most familiar with what the codes mean and which should be grouped.

data want;

set have;

array anumabb {16} numabb1 - numabb16;

do week=1 to 16 until (week_started_trial > .);

if (16 <= anumabb{week} <= 20) then do;

week_started_trial = week;

starting_trial_status = anumabb{week};

end;

end;

* May want to set a default, if there was never a trial;

if (0 < week_started_trial < 16) then

do week=week_started_trial+1 to 16 until (week_started_subscription > .);

if (1 <= anumabb{week} <= 15) then do;

week_started_subscription = week;

starting_subscription_status = anumabb{week};

end;

end;

keep id week_started_trial week_started_subscription starting_trial_status starting_subscription_status;

run;

There are lots of ways to format a table from this point. First, let's see if this is pulling out the information you need (or at least most of it).

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Astounding

05-03-2012 04:50 AM

@ Astounding

That works really well... Thanks alot! I completely understand the code, and from here i can easily reshape and format to the output that i desire.

It pulls out the desired startperiod of a subscribtion, and then tells me how many weeks later the other subscribtion starts. Its perfect - and i think that i can reformat it, so when they go from subscribtion two to three, i can follow.

Again, thanks!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-03-2012 07:03 AM

then please mark his code as correct answer. Thank you!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to TMorville

04-27-2012 05:01 AM

I think you can use ARRAY to solve ' one time period (month) how many opened, and X weeks later, how many of them are still active'.

The question followed it will be easy.

proc import datafile='c:\sampledata.xls' out=have dbms=excel replace; getnames=yes;run; data Open ; set have; array _a{*} $ change: ; month=0; do i=1 to dim(_a)-8; month+1; if _a{i} ="OPENED to TRIAL_4W" and not missing(_a{i+8}) then output; end; run;

Ksharp