BookmarkSubscribeRSS Feed
Jannie_D
Calcite | Level 5

I am working on a secure server (which means I can not copy paste anything from it) with several large datasets that I have merged into one. for the current study, I need to use an ID number called pnr, a data and price from four different databases and a population number from 1-6. The aim is to create a total price for each ID and do statistical analyses on that. BUT I have encountered a problem. I have merged my datasets by the ID number and now each Id number has more that 20 lines each. (in a dataset with more than 18 000 ID numbers it makes it impossible to overview). I have tried many different ways of making it broad instead  of long so that I can make a total variable but I have not succeeded. Hereunder I will try to go through the steps i have tried and done:

data all;

merge LMDB DRG SSSY DTSDCCG UDDA;

by pnr;

run;

 

that creates a very long dataset with several million observations.

 

Then I tried to do a proc sort step like this

PROC SORT DATA=all
DUPOUT=allnodup
NODUPRECS ;
BY pnr ;
RUN ;  

 

proc print allnodup;

run;

 

which ended up running for more than a day using a lot of capacity unsuccessfully.

 

then I tried this

 

proc sql ;
create table allnodup as
select DISTINCT (pnr),
price1,
date1,
price2,
date2,
price 3,

date3
from all
order by pnr ;
quit ;

 

which did eliminate some 

so then I did continue with with a 

 

proc transpose data=allnodup out=allw prefix=price1;

by pnr;

id date1;

run;

proc print data=allw;

run;

 

this gave we a lot errors in the log saying the "Id value...... occures twice"  and finaly "to many bad ID values"

I even tried to make to categories manually in a data step, defining payments/month but it seems as if there are seceral that have more price values every month.

Hereunder I tried to create an example of how it looks:

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;
title "All";
proc print;
run;


Obs

PNR

Price1

date1

price2

date2

price3

date3

population

1

887766

1500

1012014

45000

1012014

55000

1052018

1

2

887766

1703

3042015

45000

1012014

55000

1052018

1

3

887766

1703

3042015

45000

2012014

55000

1052018

1

4

887766

1703

3042015

45000

1012014

55000

1052018

1

5

887766

1703

3042015

45000

1012014

5701

1052018

1

6

887755

3200

4022018

245761

3042017

7001

2032016

2

7

887755

8700

4022018

245761

3042017

7001

2032016

2

8

887755

0

4022018

245761

3042017

7001

2032016

2

9

887755

3300

4022018

245761

3042017

7001

2032016

2

10

887755

3200

4022018

241

3052017

7001

2032016

2

11

887755

3200

4022018

245761

3042017

0

2032017

2

12

887744

9763

1012014

99838

1032018

5056

10252018

5

13

887744

9700

1012015

99838

1032018

5056

10252018

5

14

887744

9763

1012014

99858

1032018

5056

10252018

5

15

887744

9763

1012014

55

1072016

5056

10252018

5

16

887733

3387

7162015

45000

2032016

30050

12012014

3

17

887733

3377

7162015

45000

2032016

30050

12012014

3

18

887733

3377

7162015

789

2032017

30050

12012014

3

19

887733

3377

7162015

45000

2032016

30050

12012014

3

20

887733

3377

7162015

1700

12052016

30050

12012014

3

21

887733

3377

7162015

45000

2032016

308

12011014

3

22

887733

1500

7162015

45000

2032016

30050

12012014

3

23

887733

1500

7192015

45000

2032016

30050

12012014

3

24

887722

4352

7152017

0

3042018

47531

11222014

6

25

887722

4372

7162017

0

3042018

47531

11222014

6

26

887722

5678

7152017

0

3042018

47531

11222014

6

27

887722

4352

7152017

750

3052018

47531

11222014

6

28

887722

4352

7152017

0

3042018

4700

11252014

6

 

pnr: represents a social security number for a person but made with a secret key

costs are all in DKK

dates are all in date format except the one I did not enter hereover that is month and year format, the population is a group defined elsewhere there are 6 different groups.

 

So my aim is to create a total cost per person first per year and then overall and then compare between the populations.

 

So I am now stranded in a situation where I am not sure what I can do?

Maybe someone here can help me. 

18 REPLIES 18
Kurt_Bremser
Super User

Please make up an example of your dataset ALL that illustrates your issue, and post it as a data step with datalines.

Right now it is hard to help you, because we have no idea what your data looks like.

Jannie_D
Calcite | Level 5
I am not completly sure I understand what you mean?
PaigeMiller
Diamond | Level 26

Make up a data set that accurately represents your problem. Show it to us as SAS data step code, or via these instructions: https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/.

--
Paige Miller
Jannie_D
Calcite | Level 5
I thought that you wanted a look at how the data looked. Our data commes from big national databases and we did not make our dataset ourselves. our task is to clean and analyse it (and as a phd student with 6 months SAS experience it can be difficult when the amounts of data are this huge.
My suspicion is that there are more than one payment in one or more of the price categories and the same date. But I dont know how to get around that.

Herunder is the code I used to build the tanel above:
data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887733,3377,07162015,45000,02031016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
;;;;
title "All";
proc print;
run;
Tom
Super User Tom
Super User

Thanks. This is closer to a description of your data.  You should supply more information about what the variables mean. You should supply more information about what you are actually trying to do.  Your original request made it sound like you just want to transpose from "tall" to "wide", but this data looks like it is already wide.  You mention cleaning.  What is it that you want to check?  Are you having issues with invalid values (strings that cannot be read as dates for example?)  Or some other more complex cross record checks, like end date before start date?

 

PS When using this forum to post code use the buttons above the edit/text box to open a pop-up window to paste/type the code.  Use the same button to re-open the pop-up window to make edits to the code.  The SAS code button looks like this:

Tom_0-1620137598596.png

 

 

 

Jannie_D
Calcite | Level 5

So I tried to make an example 

 

Obs

PNR

Price1

date1

price2

date2

price3

date3

population

1

887766

1500

1012014

45000

1012014

55000

1052018

1

2

887755

3200

4022018

245761

3042017

7001

2032016

2

3

887744

9763

1012014

99838

1032018

5056

10252018

5

4

887733

3377

7162015

45000

2031016

30050

12012014

3

5

887722

4352

7152017

0

3042018

47531

11222014

6

 except theres is  another columne of price and date and that date is given in month and year. and then yes there are another 18000 rows in the original. does it help you helping me? 

PaigeMiller
Diamond | Level 26

Repeating the earlier request: "Show it to us as SAS data step code, or via these instructions: https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/."

 

Are you showing us the long data set, or the broad data set?

--
Paige Miller
Tom
Super User Tom
Super User

That looks like the output of PROC PRINT, perhaps copied from ODS output instead of normal listing output?

To share the data it would be better to dump it to the log using DSD option.

data _null_;
  set have (obs=5);
  file log dsd ;
  put (_all_) (+0);
run;

Then you could copy and paste those comma separated values into in-line data (CARDS aka DATALINES) and write an input statement to read them back into SAS variables.  Then people trying to help can just copy the code into SAS and create a copy of your sample data.

data have;
  infile cards dsd truncover ;
  input PNR Price1 date1 price2 date2 price3 date3 population;
cards;
887766,1500,1012014,45000,1012014,55000,1052018,1
887755,3200,4022018,245761,3042017,7001,2032016,2
887744,9763,1012014,99838,1032018,5056,10252018,5
887733,3377,7162015,45000,2031016,30050,12012014,3
887722,4352,7152017,0,3042018,47531,11222014,6
;

Do you really store the dates as numbers like the 1,012,014 value of DATE2 that you have in that report?

If so why are you not storing them as dates?

 

 

 

Jannie_D
Calcite | Level 5

I have revised my original post. I hope it all makes better sense now

Kurt_Bremser
Super User

Correct your code like this:

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyy10. price2:8. date2:mmddyy10. price3:8. date3:mmddyy10. population:8.
;
format date: yymmdd10.;
datalines4;

so that the dates are read as SAS dates.

Now, what output do you expect for ID 887766, just to have one example?

Jannie_D
Calcite | Level 5
Thank you for the help with formatting the date.
each pnr-number should end up having a total cost per year and a total allover cost. As I see it I need to make the dataset broad to do that. but somehow I am not succeeding.
så that I for each person can create ie. cost12014, cost22014 etc and totalcost2014 and then for the other years and finally for all years and all cost and then compare between populations. does that make sense?

to sum the costs I do need them on the same line right?
Kurt_Bremser
Super User

Instead of making a dataset broad(er), you always strive for making it narrow, or long instead of wide (see Maxim 19).

This reduces the number of columns and therefore the coding effort needed.

So you run this data steps:

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyy10. price2:8. date2:mmddyy10. price3:8. date3:mmddyy10. population:8.
;
format date: yymmdd10.;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;

data want;
set all;
by pnr notsorted;
array years {1900:2050} _temporary_;
array dates {*} date:;
array prices {*} price:;
keep pnr year total_cost;
if first.pnr then call missing(of years{*});
do i = 1 to dim(dates);
  if dates{i} ne . then years{year(dates{i})} + prices{i};
end;
if last.pnr then do year = 1900 to 2050;
  if years{year} ne .
  then do;
    total_cost = years{year};
    output;
  end;
end;
run;

which, without any further sorting, gives you just one observation per PNR and year.

From which you can easily run a report like this:

proc report data=want;
column pnr year total_cost;
define pnr / group;
define year / display;
define total_cost / analysis sum;
break after pnr / summarize;
run;

 

Jannie_D
Calcite | Level 5

I am very sorry but I think I need a bit further explanaition to understand what I am doing and to fit it with my real dataset.

so the first 3 lines makes sense. except that my file is sorted with a noduprec statement.

line 4: years (does that define the variable or array or it my variables covering years? and why temporary?

well a bit the same with next two lines: is it dates/prices or date/price (after the asterix) that is avriable?

I am skiping the keep statement as there are a few other variables that I need for subanalysis, like gender, bmi etc.

the next line I do not understand. what does it do? actualy I am not sure what any of the rest does. 

 

I can not make it work in my original dataset, probably because I do not understand the code 

 

 

data want;
set all;
by pnr notsorted;
array years {1900:2050} _temporary_;
array dates {*} date:;
array prices {*} price:;
keep pnr year total_cost;
if first.pnr then call missing(of years{*});
do i = 1 to dim(dates);
  if dates{i} ne . then years{year(dates{i})} + prices{i};
end;
if last.pnr then do year = 1900 to 2050;
  if years{year} ne .
  then do;
    total_cost = years{year};
    output;
  end;
end;
run;

 

 

which, without any further sorting, gives you just one observation per PNR and year.

From which you can easily run a report like this:

proc report data=want;
column pnr year total_cost;
define pnr / group;
define year / display;
define total_cost / analysis sum;
break after pnr / summarize;
run;

 and why temporary?

 

 

Kurt_Bremser
Super User

With your data as posted, I could also have used the DESCENDING option in the BY statement; I just wanted to be sure it works.

Line 4 creates an array as a pure memory structure (because of _TEMPORARY_) that covers all possible years with a numeric item for each. The array covers all years that are somehow probable (you can expand or shrink it as makes sense in your context).

_TEMPORARY_ has several effects:

  • all the items have the same size, so addressing a single item is _very_ fast
  • the individual items have no names (as in other arrays that consist of variables in the PDV)
  • nothing of it can appear in the output (b/c no names)
  • the array is automatically retained

The next two lines define arrays for the date and price variables in the incoming dataset. It is assumed that you will have an equal number of those.

The KEEP statement solves two purposes: it keeps only the variables needed (add yours here), and it helps define the order of the variable in the output (note that this is the first place in the code where year and total_cost appear).

At the beginning of a group (first.pnr) I clear the array, and then I scan all the dates/prices in every observation and add them up in their respective years. See the documentation for the use of OF in data step functions here.

Next, in every observation, I scan through the dates and add the matching price to the respective year sum (year derived from the date), if the date is not missing.

When a group finishes (last.pnr), I iterate through the array and write an observation for every year that has a non-missing value.

As a result, you get a very neat dataset, keeping only the necessary variables and observations (no space wasted for missing values etc).

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 1467 views
  • 0 likes
  • 4 in conversation