DATA Step, Macro, Functions and more

construct my own data sets from two source data sets

Accepted Solution Solved
Reply
Contributor
Posts: 31
Accepted Solution

construct my own data sets from two source data sets

I have two large source data sets. Below code can create example of my data sets (source data sets contain more obs and more years of information).

 data old_1;
input id PC2000 $ PC2001 $ PC2002 $ PC2003 $;
datalines;
1 AAAAA AAAAA BBBBB AAAAA
2 AAAAA BBBBB CCCCC BBBBB
3 BBBBB BBBBB BBBBB BBBBB
4 CCCCC CCCCC AAAAA BBBBB
5 DDDDD AAAAA AAAAA BBBBB
6 AAAAA BBBBB CCCCC DDDDD
7 EEEEE EEEEE DDDDD BBBBB
;

data old_2;
input PC $ region $;
datalines;
AAAAA 1111
BBBBB 1111
CCCCC 2222
DDDDD 2222
EEEEE 3333
;

My goal is to construct a new data set which counts amount of movement flows (including number of people stay and move) between each pair of region for each year of 2001, 2002, 2003. My idea is to merge the two source data first to get something like:

 

id PC2000 Region2000 PC2001 Region2001 PC2002 Region20002 PC2003 Region2003

1  .............................

2 AAAAA        1111        BBBBB      1111         CCCCC       22222         BBBBB     11111

3 ......

.

.

Then use the information to get people movement flows between each pair of region for each year of 2001, 2002, 2003. For example, since no one moved from Region 1111 in 2000 to Region 2222 in 2001 (Origin is 1111, Destination is 2222), the number of movers from Region 1111 to 2222 in 2001 is 0;   since there are 2 people (person id 2 and 6) moved from Region 1111 in 20001 to Region 2222 in 2002, the number of movers from Region 1111 to 2222 in 2001 is 2.  .....   My new data set likes:

 

 data new;
input Origin_Region $ Destination_Region $ year flow;
datalines;
1111 1111 2001 4 
1111 2222 2001 0
1111 3333 2001 0
1111 1111 2002 3
1111 2222 2002 2
1111 3333 2002 0
1111 1111 2003 4
1111 2222 2003 0
1111 3333 2003 0
2222 2222 2001 1
2222 1111 2001 1
2222 3333 2001 0
2222 2222 2002 0
2222 1111 2002 1
2222 3333 2002 0
2222 2222 2003 1
2222 1111 2003 1
2222 3333 2003 0
3333 3333 2001 1
3333 1111 2001 0
3333 2222 2001 0
3333 3333 2002 0
3333 1111 2002 0
3333 2222 2002 1
3333 3333 2003 0
3333 1111 2003 0
3333 2222 2003 0
;

Sorry for the long question! Any help will be great!

Thanks!

 


Accepted Solutions
Solution
‎06-09-2018 07:53 PM
PROC Star
Posts: 8,163

Re: construct my own data sets from two source data sets

I'm sure there is a more direct approach but, since no one has suggested any yet, here is one way to achieve your desired result:

data old_1;
  input id PC2000 $ PC2001 $ PC2002 $ PC2003 $;
  datalines;
1 AAAAA AAAAA BBBBB AAAAA
2 AAAAA BBBBB CCCCC BBBBB
3 BBBBB BBBBB BBBBB BBBBB
4 CCCCC CCCCC AAAAA BBBBB
5 DDDDD AAAAA AAAAA BBBBB
6 AAAAA BBBBB CCCCC DDDDD
7 EEEEE EEEEE DDDDD BBBBB
;

data old_2;
  input PC $ region $;
  datalines;
AAAAA 1111
BBBBB 1111
CCCCC 2222
DDDDD 2222
EEEEE 3333
;

data fmtdata;
  set old_2 (rename=(PC=Start region=label));
  retain fmtname 'rcodes' type 'C';
run;

proc format cntlin = fmtdata;
run;

data need1 (keep=id region year destination_region count);
  set old_1;
  array PCs(2000:2003) $ PC2000-PC2003;
  do _n_=2001 to 2003;
    region=put(PCs(_n_-1),$rcodes.);
    year=_n_;
    destination_region=put(PCs(_n_),$rcodes.);
    count=1;
    output;
  end;
run;

proc sql;
  create table region as
    select distinct region
      from old_2
  ;
quit;
proc sql;
  create table need2 as
   select a.region,
          b.region as destination_region
     from region a, region b
  ;
quit;

proc sql noprint;
  select id
    into :ids separated by ','
      from old_1
  ;
quit;
    
data need2;
  set need2;
  do id=&ids.;
    do year=2001 to 2003;
      output;
    end;
  end;
run;

proc sort data=need1;
  by id year region destination_region;
run;

proc sort data=need2;
  by id year region destination_region;
run;

data need3;
  update need2 need1;
  by id year  region destination_region;
  if missing(count) then count=0;
run;

proc sql;
  create table want as
    select year, region, destination_region, sum(count) as flow
      from need3
        group by region, year, destination_region
  ;
quit;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Regular Contributor
Posts: 152

Re: construct my own data sets from two source data sets

very difficult to understand what you're saying, but you might want proc transpose

--------------
blog: papersandprograms.com
Contributor
Posts: 31

Re: construct my own data sets from two source data sets

Posted in reply to PaulBrownPhD

PC is the postal code, Region has id as 1111, 2222 , 3333. So, postal code AAAAA and BBBBB are in Region 1111. I am trying to count the number of movers between Regions for each year.

Regular Contributor
Posts: 152

Re: construct my own data sets from two source data sets

[ Edited ]

firstly i think you want old1 on id-pc level, not id level ie pc as rows, not as cols, then you want to merge region from old2 onto old1, by pc (first sort old1 by pc). Then you have id and region on the same dataset, and you could use eg a simple proc freq or proc sort with nodupkey and dupout=...., to see how many ids have multiple regions

--------------
blog: papersandprograms.com
Solution
‎06-09-2018 07:53 PM
PROC Star
Posts: 8,163

Re: construct my own data sets from two source data sets

I'm sure there is a more direct approach but, since no one has suggested any yet, here is one way to achieve your desired result:

data old_1;
  input id PC2000 $ PC2001 $ PC2002 $ PC2003 $;
  datalines;
1 AAAAA AAAAA BBBBB AAAAA
2 AAAAA BBBBB CCCCC BBBBB
3 BBBBB BBBBB BBBBB BBBBB
4 CCCCC CCCCC AAAAA BBBBB
5 DDDDD AAAAA AAAAA BBBBB
6 AAAAA BBBBB CCCCC DDDDD
7 EEEEE EEEEE DDDDD BBBBB
;

data old_2;
  input PC $ region $;
  datalines;
AAAAA 1111
BBBBB 1111
CCCCC 2222
DDDDD 2222
EEEEE 3333
;

data fmtdata;
  set old_2 (rename=(PC=Start region=label));
  retain fmtname 'rcodes' type 'C';
run;

proc format cntlin = fmtdata;
run;

data need1 (keep=id region year destination_region count);
  set old_1;
  array PCs(2000:2003) $ PC2000-PC2003;
  do _n_=2001 to 2003;
    region=put(PCs(_n_-1),$rcodes.);
    year=_n_;
    destination_region=put(PCs(_n_),$rcodes.);
    count=1;
    output;
  end;
run;

proc sql;
  create table region as
    select distinct region
      from old_2
  ;
quit;
proc sql;
  create table need2 as
   select a.region,
          b.region as destination_region
     from region a, region b
  ;
quit;

proc sql noprint;
  select id
    into :ids separated by ','
      from old_1
  ;
quit;
    
data need2;
  set need2;
  do id=&ids.;
    do year=2001 to 2003;
      output;
    end;
  end;
run;

proc sort data=need1;
  by id year region destination_region;
run;

proc sort data=need2;
  by id year region destination_region;
run;

data need3;
  update need2 need1;
  by id year  region destination_region;
  if missing(count) then count=0;
run;

proc sql;
  create table want as
    select year, region, destination_region, sum(count) as flow
      from need3
        group by region, year, destination_region
  ;
quit;

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 31

Re: construct my own data sets from two source data sets

Thanks for the code. It works!

Contributor
Posts: 21

Re: construct my own data sets from two source data sets

Here is what i tried. Is that right?

 data have;
input city2000 $ city2001 $ city2002 $ city2003 $;
datalines;
a a b c
a a a a
b a a a
c c c c
c a a a
;

data _null_;
if _n_=1 then do;
if 0 then set have;
length o d $50;
  dcl hash H1 (multidata:'y',ordered:'y') ;
   h1.definekey  ('year','o','d') ;
   h1.definedata ('year',"o","d") ;
   h1.definedone () ;
   call missing(o,d);
end;
set have end=last;
array t(*) pc:;
do _n_= 2 to dim(t);
year=vname(t(_n_));
o=t(_n_-1);
d=t(_n_);
rc=h1.add();
end;
if last then h1.output(dataset:'temp2');
run;

data want;
if _n_=1 then do;
if 0 then set temp3;
  dcl hash H1 (dataset:'temp3',multidata:'y') ;
   h1.definekey  ('year','o','d') ;
   h1.definedata ('flows') ;
   h1.definedone () ;
end;
if 0 then set old_1;
array t(*) pc:;
set sort_2(rename=(region=o));
do _n_= 2 to dim(t);
year=vname(t(_n_));
 do n=1 to nobs;
 set sort_2(keep=region rename=(region=d)) nobs=nobs point=n;
flows=0;
rc= h1.find();
 output;
end;
end;
keep year o d flows;
run;
Frequent Contributor
Posts: 103

Re: construct my own data sets from two source data sets

Can you explain the logic for one obs in want

1111 1111 2001 4

 

like how you derive this?

Contributor
Posts: 31

Re: construct my own data sets from two source data sets

In the year of 2001, from obs 1, 2, 3, and 6, there are movement but all within region 1111. 

PROC Star
Posts: 1,766

Re: construct my own data sets from two source data sets

[ Edited ]
data old_1;
  input id PC2000 $ PC2001 $ PC2002 $ PC2003 $;
  datalines;
1 AAAAA AAAAA BBBBB AAAAA
2 AAAAA BBBBB CCCCC BBBBB
3 BBBBB BBBBB BBBBB BBBBB
4 CCCCC CCCCC AAAAA BBBBB
5 DDDDD AAAAA AAAAA BBBBB
6 AAAAA BBBBB CCCCC DDDDD
7 EEEEE EEEEE DDDDD BBBBB
;

data old_2;
  input PC $ region $;
  datalines;
AAAAA 1111
BBBBB 1111
CCCCC 2222
DDDDD 2222
EEEEE 3333
;

data temp;
if _n_=1 then do;
if 0 then set old_2;
   dcl hash H (dataset:'old_2') ;
   h.definekey  ("pc") ;
   h.definedata ("region") ;
   h.definedone () ;
end;
set old_1;
array t(*) pc:;
do _n_=1 to dim(t);
if h.find(key:t(_n_))=0 then t(_n_)=region;
end;
drop pc region;
run;

data _null_;
if _n_=1 then do;
if 0 then set temp;
length o d $50;
  dcl hash H1 (multidata:'y',ordered:'y') ;
   h1.definekey  ('id','year','o','d') ;
   h1.definedata ('id','year',"o","d") ;
   h1.definedone () ;
   call missing(o,d);
end;
set temp end=last;
array t(*) pc:;
do _n_= 2 to dim(t);
year=vname(t(_n_));
o=t(_n_-1);
d=t(_n_);
rc=h1.add();
end;
if last then h1.output(dataset:'temp2');
run;


proc sort data=old_2 out=sort_2(keep=region) nodupkey;
by region;
run;

proc sql;
  create table temp3 as
    select year, o, d, count(*) as flows
      from temp2
        group by year, o, d;
		  
quit;

data want;
if _n_=1 then do;
if 0 then set temp3;
  dcl hash H1 (dataset:'temp3',multidata:'y') ;
   h1.definekey  ('year','o','d') ;
   h1.definedata ('flows') ;
   h1.definedone () ;
end;
if 0 then set old_1;
array t(*) pc:;
set sort_2(rename=(region=o));
do _n_= 2 to dim(t);
year=vname(t(_n_));
 do n=1 to nobs;
 set sort_2(keep=region rename=(region=d)) nobs=nobs point=n;
flows=0;
rc= h1.find();
 output;
end;
end;
keep year o d flows;
run;
PROC Star
Posts: 1,766

Re: construct my own data sets from two source data sets

[ Edited ]
Posted in reply to novinosrin

@sasecn I can certainly concise the above response by further shortening the steps by combining (temp and temp2),( temp3 and want) into one step but for testing purpose it seemed much easier for my eyes

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 266 views
  • 1 like
  • 6 in conversation