Solved: Re: Best way to "transpose"/make wide this data

UniversitySas · Posted 09-20-2018 11:46 PM

So, what I'm asking is not exactly a transpose (from my knowledge) and I'm finding it very tricky! Any help will be immensely appreciated.

I have some sample data:

DATA Unclean ;
 
INFILE DATALINES ;
 input ID VAR & $19. CLASSIFICATION $ ;
 DATALINES;
201  PepsiCo  XYZ  
201  111 main street  XYZ  
201  Charity  XYZ  
331  CocaCola  XYZ  
331  1823 unicorn street  XYZ  
331  Charity  XYZ  
331  Nike  XYZ  
331  123 brock avenue  XYZ  
331  Charity  XYZ  
331  Adidas  XYZ  
331  14 Market street  XYZ  
331  Charity  XYZ  
354  McDonalds  XYZ  
354  111 Obesity road  XYZ  
354  Charity  XYZ  
;
RUN;

And my goal would be to get an output like this:

DATA CLEAN;
INFILE DATALINES ;
 input ID Name & $10. Address & $19. Classification $ ;
 DATALINES;
201  PepsiCo  111 main street	Charity 
331  CocaCola  1823 unicorn street  Charity   
331  Nike  123 brock avenue  Charity  
331  Adidas  14 Market street  Charity  
354  McDonalds  111 Obesity road  Charity  
;
RUN;

Specifically, there are a few challenges I am having.

1) The IDs have uneven companies attached to them

2) The value "XYZ" needs to be changed to reflect things like Name, Address, and Classification.
Regarding 2) My fix has been that whenever we see an observation "Charity", the next observation is always a "Name".

Is there another way, more flexible way to approach this?

3) What would be the best function or command to achieve this?

Any help is very much appreciated!

KachiM · Posted 09-21-2018 06:36 AM

You should have placed a SAMPLE Data Set that represent your real Data SET. I have given a solution based on your sample Data Set.

Now you say there will be varying number of records - 2 records instead of 3. I hope you don't come with newer issues with your original Data Set.

The solution is based on first creating triplets with adding "NONE" when "Charity" is missing. Next the same program without using LAST.ID. We can simply check for either "NONE" or "Charity" to output a record at a time.

data source;
INFILE DATALINES ;
 input ID VAR & $19. CLASSIFICATION $ ;
 DATALINES;
201  PepsiCo  XYZ  
201  111 main street  XYZ  
201  Charity  XYZ  
331  CocaCola  XYZ  
331  1823 unicorn street  XYZ  
331  Charity  XYZ  
331  Nike  XYZ  
331  123 brock avenue  XYZ  
331  Charity  XYZ  
331  Adidas  XYZ  
331  14 Market street  XYZ  
331  Charity  XYZ  
354  McDonalds  XYZ  
354  111 Obesity road  XYZ  
354  Charity  XYZ  
354  KFC  XYZ  
354 111 Fat Avenue  XYZ  
441  Burger King  XYZ  
441  112 Skinny ROAD  XYZ
441  Adidas  XYZ
441  Socks Blvd  XYZ
441  Tiege Henley  XYZ
441  Skincare Road  XYZ
441  Charity  XYZ
;
RUN;
proc print data = source;
run;

data have;
   set source;
   if Var = "Charity" then i = 0;
   else i+1;
   if i = 3 and Var ^= "Charity" then do; 
      temp = Var; 
      Var = "NONE";  
      output;
      Var = temp;
      output;
      temp =' ';
      i = 1;
   end;
   else output;
drop temp i;
run;

proc print data = have;
run;

The Data Set, HAVE, will have triplets either with "NONE" or "Charity".

The next the old program without LAST.ID checks.

data clean;
length ID 8 Name $10 Address $19 Classification $8 Var $19; 
   do i = 1 by 1 until(VAR in ("NONE","Charity"));
      set have;
      select(mod(i,3));
         when(1) Name=Var;
         when(2) Address = Var;
         when(0) Classification = Var;
         otherwise;
      end;
   end;
drop i Var;
run;
proc print data = clean;
run;

View solution in original post

Reeza · Posted 09-20-2018 11:49 PM

Was the original format JSON or XML?
And didn’t that have the tags to create the classification variables?

UniversitySas · Posted 09-20-2018 11:52 PM

The data I received was a sas file actually. It's extremely granular data, and from what I've been told the original data file wouldn't have what you're asking

Reeza · Posted 09-21-2018 12:02 AM

If it’s always this style with three lines it’s actually trivial. If you can have varying lines per company then it’s problematic.

Use MOD() to count lines.

Data have;
Set source;
X=mod(_n_, 3);
If x=1 then class=‘name’;
Else if class =2 then class = ‘address’;
Else if class=0 then class = ‘type’;

Retain record;
If x=1 record+1;
Run;

Proc transpose data=have out=want;
By Id record;
Var var;
Id class;
Run;

Untested.

VDD · Posted 09-20-2018 11:49 PM

this link should assist you with understanding how to transpose data:

https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

It looks like the x = row 1, y = row 2, z = row 3

output your record.

start you count over for x,y.z;

basically looks like 3 records per final record.

UniversitySas · Posted 09-20-2018 11:59 PM

I've had a look at that page, I couldn't really get anything to work.

Are you insinuating that I need to use a LOOP here?

VDD · Posted 09-21-2018 12:01 AM

yes you may need to hold record 1 and record 2 and output after obtaining record 3

VDD · Posted 09-21-2018 12:07 AM

proc transpose data=have out=want(drop=_name_);
   by id;
   var var;
run;

you will need to pass the var-names to the columns

UniversitySas · Posted 09-21-2018 12:15 AM

For now the adding the correct var names are not too important, so I can address that later.

But using a PROC TRANSPOSE gives me something like this:

It fails to re-iterate values for ID 331, instead just lumping them into the same row.

VDD · Posted 09-21-2018 12:25 AM

you need to clean your data as @Reeza has noted above first. Then use the proc transpose.

VDD · Posted 09-21-2018 12:39 AM

with cleaned up data based on @Reeza datastep

data source;
INFILE DATALINES ;
 input ID VAR & $19. CLASSIFICATION $ ;
 DATALINES;
201  PepsiCo  XYZ  
201  111 main street  XYZ  
201  Charity  XYZ  
331  CocaCola  XYZ  
331  1823 unicorn street  XYZ  
331  Charity  XYZ  
331  Nike  XYZ  
331  123 brock avenue  XYZ  
331  Charity  XYZ  
331  Adidas  XYZ  
331  14 Market street  XYZ  
331  Charity  XYZ  
354  McDonalds  XYZ  
354  111 Obesity road  XYZ  
354  Charity  XYZ  
;
RUN;


Data have;
length class $7;
Set source;
X=mod(_n_, 3);
If x=1 then class="name";
Else if x =2 then class = "address";
Else if x=0 then class = "type";

Retain record;
If x=1 then record+1;
Run;

Proc transpose data=have(drop=record _name_) out=want;
By Id record;
Var var;
Id class;
Run;

UniversitySas · Posted 09-21-2018 01:41 AM

This is perfect, thank you so much!

KachiM · Posted 09-21-2018 01:37 AM

I suppose you don't insist on PROC TRANSPOSE. In one Data Step you will get the required output. Assumed that the Data Set is sorted by ID.

The sequential records provide one of the three Variables where LASTID can be used(except for ID = 331). For ID = 331, the alternate check with "Charity" breaks them into several records. The use of MOD(i,3) picks the relevant Value for Variables.

data clean;
length ID 8 Name $10 Address $19 Classification $8 Var $19; 
   do i = 1 by 1 until(last.id | Var = 'Charity');
      set unclean;
      by id notsorted;
  
      select(mod(i,3));
         when(1) Name=Var;
         when(2) Address = Var;
         when(0) Classification = Var;
         otherwise;
      end;
   end;
drop i Var;
run;

UniversitySas · Posted 09-21-2018 01:41 AM

This looks fantastic - still very new to SAS so i'm very keen on working through this code you posted - it looks extremely compact!

UniversitySas · Posted 09-21-2018 03:42 AM

Hey datasp,

So one thing I just realised was that the code is not quite flexible enough; I was wondering if you had a fix.

Some of my data is organised in a way that the first 1/4 might have the recurring pattern for 9 rows, the 2nd quarter for 8 rows, the third quarter for 9 rows again and the final quarter for 7 rows.

I understand it is a little vague since I don't have a sample of the dataset here (which I can provide if it's needed), but do you have an alternate strategy for the case above?

Would it my best bet be to break the data into separate pieces and then merge it again later, or is there another way?

SAS Innovate 2025: Call for Content

Classroom Training Available!