Solved: Large Dataset Long to Wide

OlsabeckT29 · Posted 06-28-2022 11:31 AM

Hello, I am working with a long dataset of about 15 million records that currently is listing out the vaccine details for each shot a person received. I need to get this dataset into a wide dataset. I have 9 variables that need to be transposed and would rather not do 9 proc transpose steps with this large of a dataset. I have recip_id as my by variable and Dose1 is what I need to transpose on. Dose1 ranges from 1-13.

Is there a more efficient way to get what I need? I am running SAS 9.4

Below is an example of what I have vs what I want:

data have;

input recip_id Dose1 admin_date CVX MC Auth Lot_id dose_num vax_fund admin_state interval

Datalines;

1 1 10/16/2021 208 PF US acvds 1 k MI .

1 2 11/6/2021 208 PF US bdsa 2 k MI 21

1 3 5/29/2021 207 MO US bde1de 3 k MI 90

2 1 6/7/2021 207 MO US 2ds42 1 h OH .

3 1 7/7/2021 208 PF US 3fs1 1 h WI .

3 2 7/7/2021 208 PF US fews2 2 k WI 28

3 3 8/4/2021 208 PF US gesw21 3 k WI 87

4 1 1/21/2022 208 PF US fewq21 1 h NY .

4 2 12/22/2020 208 PF US gresa21 2 h NY 28

5 1 1/12/2021 212 JS US gewa2 1 k WV .

6 1 4/12/2021 207 MO US fewqa2 1 k IN .

6 2 4/29/2021 207 MO US gewad2 2 k IN 27

Want

recip_id Dose1 admin_date1 CVX1 MC1 Auth Lot_id1 dose_num1 vax_fund1 admin_state1 interval1 admin_date2 CVX2 MC2 Auth2 Lot_id2 dose_num2 vax_fund2 admin_state2 interval2 admin_date3 CVX3 MC3 Auth3 Lot_id3 dose_num3 vax_fund3 admin_state3 interval3

1 1 10/16/2021 208 PF US acvds 1 k MI . 11/6/2021 208 PF US bdsa 2 k MI 21 5/29/2021 207 MO US bde1de 3 k MI 90

2 1 6/7/2021 207 MO US 2ds42 1 h OH .

3 1 7/7/2021 208 PF US 3fs1 1 h WI . 7/7/2021 208 PF US fews2 2 k WI 28 8/4/2021 208 PF US gesw21 3 k WI 87

4 1 1/21/2022 208 PF US fewq21 1 h NY . 12/22/2020 208 PF US gresa21 2 h NY 28

5 1 1/12/2021 212 JS US gewa2 1 k WV .

6 1 4/12/2021 207 MO US fewqa2 1 k IN . 4/29/2021 207 MO US gewad2 2 k IN 27

Tom · Posted 06-28-2022 01:23 PM

Why idoesn't the dashboard just read the data directly from the source database? Why copy it into a text file?

What type of text file? A delimited file, like a CSV file?

How robust is the tool that reads the "text" file? Could it handle having some lines with fewer values than others?

If so then just generate the text file directly with a data step.

data _null_;
  set have;
  by id;
  file out dsd ;
  if first.id then put irecip_id Dose1 @;
  put admin_date CVX MC Auth Lot_id dose_num vax_fund admin_state interval @;
  if last.id then put;
run;

And if you need add the extra commas at the ends of the lines then just make the code a little more complex.

For example if there should be 10 replicates of those 9 variables then you could use something like this to write the extra commas to ends of the lines for the ids that had fewer than ten.

  if first.id then put irecip_id Dose1 @;
  put admin_date CVX MC Auth Lot_id dose_num vax_fund admin_state interval @;
  dummy=' ';
  if last.id then do;
    do i=1 to 9*(10-dose1); put dummy @; end
    put;
  end;

You could even make it smart enough to count the number of observations if that DOSE1 variable is not actually a count of the number of observations per ID.

View solution in original post

HB · Posted 06-28-2022 11:50 AM

May I ask why you would want to do that? You are denormalizing the data- things like state, state2, state3, state4, etc. are bad from a database design perspective.
What is the goal? To display vaccine history by personID? Because you can do lots of things to present the data without messing up the data structure.

ballardw · Posted 06-28-2022 12:02 PM

Do you actually need a data set? What will you do with the data set?

If you are expecting a human to read the result that would be a Report and making a report would likely be the approach.

Did you test your data step before providing it to us? It throws many 'invalid data' errors. Your input statement (besides not ending in a semicolon ) would attempt to make all of the variables numeric. Most of your variables are apparently not numeric plus there appears to be an issue with spacing, possibly caused by pasting tab delimited data. Also the datalines do not end in a semicolon either.

OlsabeckT29 · Posted 06-28-2022 12:45 PM

The dataset I have in sas is read through an oracle connection and has millions of observations. I just type it into the question to provide an example of what I am working with.

This dataset is used to generate a dashboard in Power BI so the end result is an exported txt file. The dashboard shows vaccine coverage rates. It needs to be a person level dataset, not a shot level dataset to do this.

Tom · Posted 06-28-2022 01:23 PM

Why idoesn't the dashboard just read the data directly from the source database? Why copy it into a text file?

What type of text file? A delimited file, like a CSV file?

How robust is the tool that reads the "text" file? Could it handle having some lines with fewer values than others?

If so then just generate the text file directly with a data step.

data _null_;
  set have;
  by id;
  file out dsd ;
  if first.id then put irecip_id Dose1 @;
  put admin_date CVX MC Auth Lot_id dose_num vax_fund admin_state interval @;
  if last.id then put;
run;

And if you need add the extra commas at the ends of the lines then just make the code a little more complex.

For example if there should be 10 replicates of those 9 variables then you could use something like this to write the extra commas to ends of the lines for the ids that had fewer than ten.

  if first.id then put irecip_id Dose1 @;
  put admin_date CVX MC Auth Lot_id dose_num vax_fund admin_state interval @;
  dummy=' ';
  if last.id then do;
    do i=1 to 9*(10-dose1); put dummy @; end
    put;
  end;

You could even make it smart enough to count the number of observations if that DOSE1 variable is not actually a count of the number of observations per ID.

pink_poodle · Posted 06-28-2022 12:36 PM

You would need to transpose long to wide using arrays. Here is a post about it:

https://communities.sas.com/t5/SAS-Programming/Long-to-wide-Array/td-p/158600

Tom · Posted 06-28-2022 12:39 PM

https://support.sas.com/resources/papers/proceedings10/102-2010.pdf

Ksharp · Posted 06-29-2022 12:23 AM

 

data have;
input recip_id Dose1 (admin_date CVX MC Auth Lot_id dose_num vax_fund admin_state interval) (:$40.);
Datalines;
1 1 10/16/2021 208 PF US acvds 1 k MI .
1 2 11/6/2021 208 PF US bdsa 2 k MI  21
1 3 5/29/2021 207 MO US bde1de 3 k MI 90
2 1 6/7/2021 207 MO US 2ds42 1 h OH . 
3 1 7/7/2021 208 PF US 3fs1 1 h WI .
3 2 7/7/2021 208 PF US fews2 2 k WI 28
3 3 8/4/2021 208 PF US gesw21 3 k WI 87
4 1 1/21/2022 208 PF US fewq21 1 h NY .
4 2 12/22/2020 208 PF US gresa21 2 h NY 28
5 1 1/12/2021 212 JS US gewa2 1 k WV .
6 1 4/12/2021 207 MO US fewqa2 1 k IN .
6 2 4/29/2021 207 MO US gewad2 2 k IN 27
;
run;

proc sql noprint;
select distinct catt('have(where=(Dose1=',Dose1,') rename=(
admin_date=admin_date',Dose1,' CVX=CVX',Dose1,' MC=MC',Dose1,
' Auth=Auth',Dose1,' Lot_id=Lot_id',Lot_id,' dose_num=dose_num',Dose1,
' vax_fund=vax_fund',Dose1,' admin_state=admin_state',Dose1,' interval=interval',Dose1,'
))') into :merge separated by ' '
from have;
quit;
data want;
merge &merge.;
by recip_id;
drop Dose1;
run;

Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Re: Large Dataset Long to Wide

Classroom Training Available!