DATA Step, Macro, Functions and more

Data Seperation

Reply
Contributor
Posts: 71

Data Seperation

I have an issue with my data hope u can help me in this. In my data set one variable data is data is populated lik car; tv; washing machine; mobile; cam . I have 15mill records in that data set. For this variable I need to separate each product into new variable.


Data a1;

set a;
Input var $50;
Product="car; tv; fridge; washingmachine; mobile"; output;

Product="Car; fridge; phone"; output;
Run;
This is sample data. Each product is seperated by " ; ".  Next to ; is new product which need to be created in seperate variable


I'm looking for output like


x1 x2 x3 x4 product                                                            product1 product2  product3   product 4                      product5

  1  a   a   a  car; tv; fridge; washingmachine; mobile     car             tv            fridge         washing machine     Mobile        

  2   b   b  b Car; fridge; phone                                           car              Fridge    Phone                                                    



x1 x2 x3 x4 are the other variables which are present in my data set.


Regard,

venky


Super User
Posts: 5,260

Re: Data Seperation

It would help if you know how many products there can be as a maximum.

Could be solved by using an array, do-loop with scan.

Data never sleeps
Contributor
Posts: 71

Re: Data Seperation

Thanks for the reply. The problem here is we don't know how many products are there a person may took finance for single or multiple products.

Super User
Posts: 17,868

Re: Data Seperation

I recommend a different structure, as that will be sparse.  This is also easier to generate and summarize later on. 

id product

1 car

1 tv

1 fridge

1 washing machine

2 car

2 fridge

2 phone

Contributor
Posts: 71

Re: Data Seperation

You are correct but  my data in that column is defined like that. I found of alternative way. the problem is if this variable becomes significant at time of modeling its problem we need to share the logic  that apples on there DB.

Super User
Posts: 5,260

Re: Data Seperation

I agree with https://communities.sas.com/people/Reeza, much better layout for reporting/analysis.

And Beauty of it is that to accomplish that, you don't need to know about how many Products there may be, and you don't have to involve arrays. Just use do - until with explicit output.

This could be a way forward - mainly depending in what kind of data resides in the other variables.

Data never sleeps
Super User
Posts: 17,868

Re: Data Seperation

I'm confused. It sounds like you have the data in one format. Are you uploading it into a database that is restricted to that format?

There is no issue with the Have data set, I have issues with the Want data set.

Contributor SKK
Contributor
Posts: 35

Re: Data Seperation

Hi.. I think this might work for you..

Data have;
infile datalines missover;
input x1 x2 $ x3 $ x4 $ products :$100.;
datalines4;
1  a   a   a  car;tv;fridge;washingmachine;mobile          
2   b   b  b Car;fridge;phone    
;;;;
run;


proc sql noprint;
select put((max(count(products,";"))+1),best.) into :tot from have;
quit;

data want(drop=i);
set have;
array pro(*) $30 product1-product%sysfunc(strip(&tot));
do i=1 to %sysfunc(strip(&tot));
pro(i)=scan(products,i,";");
end;
run;

Frequent Contributor
Posts: 106

Re: Data Seperation

try this,

Data a1;

length product $10000;

Product="car; tv; fridge; washingmachine; mobile"; output;

product="Car; fridge; phone"; output;

Product="car1; tv; fridge; washingmachine; mobile;car; tv; fridge; washingmachine; mobile"; output;

Run;

%macro test;

data a1;set a1;

dlm= countw(product,';');

run;

proc sql noprint;

select max(dlm) into :max_dlm from a1;

quit;

data want;

set a1;

%do i=1 %to &max_dlm.;

product&i=scan(product,&i,';');

%end;

run;

%mend;

%test;

Regular Contributor
Posts: 217

Re: Data Seperation

Hi Pallis,

I agree with pradeepalankar's solution.  Sometimes we cannot change the data source.  Architects build databases for reasons that are best for their purposes.  P's solution uses the data source as it is given to you.  Also, P's solution allows for the use of a scan function to start looking for new products after the position of the fourth semicolon (fifth word)  "scan(productlist,4,';')" if you want to ignore the first four products.

Ask a Question
Discussion stats
  • 9 replies
  • 256 views
  • 0 likes
  • 6 in conversation