DATA Step, Macro, Functions and more

Data transform

Accepted Solution Solved
Reply
Contributor
Posts: 36
Accepted Solution

Data transform

Hi all,

I'm performing one study about social network analysis. Before analysis, I need to transform my data in the appropriate format. My raw data are formatted like this:

TomJack
JackMike
TomHellen
DoraHellen
RossJack
WillPaul
PaulHoward
PaulAndy
AndyKim
AndyTom

One row means the two guys in the two columns are directly conneceted  (No matter left-right connected or right-left connected). I need to transform my data like this( if the two guys are directly connectted, the content in the matrix are 1. Finally,one symmetric matrxi will be created):

namesAndyDoraHellenHowardJackKimMikePaulRossTomWill
Andy10000101010
Dora01000000000
Hellen00100000010
Howard00010001000
Jack00001010110
Kim10000100000
Mike00001010000
Paul10010001000
Ross00001000100
Tom10101000010
Will00000000001

How to achieve this? Thanks in advance.

Tom


Accepted Solutions
Solution
‎08-07-2013 11:44 PM
Respected Advisor
Posts: 3,124

Re: Data transform

Obviously this is matrix problem, so Proc IML would probably be my first choice as well. However, Data step can also address this problem using Hash(). This may look a little bit complex, but if you want to build a social network deeper than just one-layer direct connection, then Hash() will  get  very handy quickly.

data have;

input var1 $ var2 $;

cards;

Tom   Jack

Jack  Mike

Tom   Hellen

Dora  Hellen

Ross  Jack

Will  Paul

Paul  Howard

Paul  Andy

Andy  Kim

Andy  Tom

;

data h1;

  set have;

    var=var1;output;

      var=var2;output;

run;

proc sql;

  select distinct var into :a_var separated by ' ' from h1;

    select distinct var into :c_var separated by ', ' from h1;

  select distinct quote(cats(var)) into :h_var separated by ',' from h1;

  quit;

data _null_;

  if _n_=1 then do;

    if 0 then set have;

      declare hash v1(dataset:'have', multidata:'y');

      v1.definekey('var1');

      v1.definedata(all:'y');

      v1.definedone();

            declare hash v2(dataset:'have', multidata:'y');

      v2.definekey('var2');

      v2.definedata(all:'y');

      v2.definedone();

   declare hash want(ordered:'a');

   want.definekey('var');

   want.definedata('var',&h_var.);

   want.definedone();

   end;

   set h1 end=last;

    array _m  &a_var ;

      retain &a_var 0;

      if _n_>1 then do over _m;

_m=0;

       end;

      rc=want.find();

      rc=v1.find(key:var);

         do rc=0 by 0 while (rc=0);

do over _m;

_m=ifn(var=vname(_m) or var2=vname(_m),1,_m);

end;

rc=v1.find_next(key:var);

            end;

      

      rc=v2.find(key:var);

         do rc=0 by 0 while (rc=0);

do over _m;

_m=ifn(var=vname(_m) or var1=vname(_m),1,_m);

end;

rc=v2.find_next(key:var);

            end;

            rc=want.replace();

       if last then rc=want.output(dataset:'want');

      run;


Haikuo

View solution in original post


All Replies
Super User
Super User
Posts: 6,500

Re: Data transform

Here is a post on this that uses IML.

http://nutrinjen.wordpress.com/2009/09/15/adjacency-matrix-using-sasiml/

The idea is really the same in a SAS datastep.

Contributor
Posts: 36

Re: Data transform

Thanks. But it seems I have different data format as the wirter.....

Tom

Super Contributor
Posts: 307

Re: Data transform

Your desired output does not seem to match the sample input data that you have provided.

For example, Andy is connected to Kim, Tom (and Paul, if you go right-to-left) according to your input data. But, your desired output has a match for Hellen even though there is no row in the input data for Andy--> Helen (or vice versa).

But Hellen is "connected" to Tom (who is connected to Andy). So does this imply that you want a match if there is a link through distant connections? So, Andy is "connected" to Tom who is "connected" to Hellen; therefore, Andy and Hellen have a 1 in the matrix? Are only right--> left connections allowed? Or, two-way? If indirect connections are permitted, how many levels are allowed?

The method that Tom linked re: adjacency matrix will work with some modifications (assuming name1 in your input is the "ego" and name2 is the "alter"), especially if you associate a unique ID with every individual.

I've attached some relatively crude code that will mostly accomplish what you want (ASSUMING only left--> right matches, only matching rows in the input file are considered as a match (not downstream matches), and you always want a 1 when the person is the same as the column name).

Attachment
Contributor
Posts: 36

Re: Data transform

Thanks. Sorry I made a mistake. Only direct connections are permitted. So there still is one problem about the codes you attached. Finally, one symmetric matrix will be created. So, on "andy" column, "kim" and "tom" rows will also be 1.




Tom

Super Contributor
Posts: 307

Re: Data transform

The code I sent you will transform the data you supplied into a matrix. You probably missed that I set a dummy variable = 1 in the initial data step (reading in the initial values).

Contributor
Posts: 36

Re: Data transform

Done. Thanks.

proc iml;

use havesorted;

read all var _NUM_ into m;

n_row=nrow(m);

n_col=ncol(m);

do i=1 to n_col;

   do j=1 to n_row;

      if m[j,i]=1 then m[i,j]=1;

   end;

end;

print m;

create MyData from m;

append from m;

close MyData;

quit;

Super Contributor
Posts: 297

Re: Data transform

I don't understand how you are defining your matrix.  You are specifying the creation of a numeric matrix, but you don't have any numeric variables.

Solution
‎08-07-2013 11:44 PM
Respected Advisor
Posts: 3,124

Re: Data transform

Obviously this is matrix problem, so Proc IML would probably be my first choice as well. However, Data step can also address this problem using Hash(). This may look a little bit complex, but if you want to build a social network deeper than just one-layer direct connection, then Hash() will  get  very handy quickly.

data have;

input var1 $ var2 $;

cards;

Tom   Jack

Jack  Mike

Tom   Hellen

Dora  Hellen

Ross  Jack

Will  Paul

Paul  Howard

Paul  Andy

Andy  Kim

Andy  Tom

;

data h1;

  set have;

    var=var1;output;

      var=var2;output;

run;

proc sql;

  select distinct var into :a_var separated by ' ' from h1;

    select distinct var into :c_var separated by ', ' from h1;

  select distinct quote(cats(var)) into :h_var separated by ',' from h1;

  quit;

data _null_;

  if _n_=1 then do;

    if 0 then set have;

      declare hash v1(dataset:'have', multidata:'y');

      v1.definekey('var1');

      v1.definedata(all:'y');

      v1.definedone();

            declare hash v2(dataset:'have', multidata:'y');

      v2.definekey('var2');

      v2.definedata(all:'y');

      v2.definedone();

   declare hash want(ordered:'a');

   want.definekey('var');

   want.definedata('var',&h_var.);

   want.definedone();

   end;

   set h1 end=last;

    array _m  &a_var ;

      retain &a_var 0;

      if _n_>1 then do over _m;

_m=0;

       end;

      rc=want.find();

      rc=v1.find(key:var);

         do rc=0 by 0 while (rc=0);

do over _m;

_m=ifn(var=vname(_m) or var2=vname(_m),1,_m);

end;

rc=v1.find_next(key:var);

            end;

      

      rc=v2.find(key:var);

         do rc=0 by 0 while (rc=0);

do over _m;

_m=ifn(var=vname(_m) or var1=vname(_m),1,_m);

end;

rc=v2.find_next(key:var);

            end;

            rc=want.replace();

       if last then rc=want.output(dataset:'want');

      run;


Haikuo

Contributor
Posts: 36

Re: Data transform

Thanks, Hai.kuo. If I have 100 thousands of IDs, the length of a_var and c_var will exceed the maximal length of one sas string. How to deal with this situation?

Tom

Super Contributor
Posts: 339

Re: Data transform

@Hai.Kuo,

If I recall, graph theory dictates you can rebuild chain links by simply putting your matrix ^n where n would be either the chain length you consider a chain link to still be a link or is the dimension of your matrix. You will then simply need to convert all values >1 to 1s to return to a binary matrix (it can be optimized, computationally, by using binary "or" but I am not sure that SAS supports it quite well).

@TomiKong

With regards to what Hai.Kuo wrote, a "simple" way around exceeding the 65k default length of macro variables, would be to load your var names into multiple macro variables and nest the steps using the macro variables into another macro and loop on those. This could be done as follow:

break down

proc sql;

  select distinct var into :a_var separated by ' ' from h1;

    select distinct var into :c_var separated by ', ' from h1;

  select distinct quote(cats(var)) into :h_var separated by ',' from h1;

  quit;

into

proc sql;

   

  select distinct var into :a_var1-:a_var99999999 from h1;

    select distinct var into :c_var1-:c_var99999999 from h1;

quit;

%let loopend1=&sqlobs;

proc sql;

select distinct quote(cats(var)) into :h_var1-:h_var99999999 from h1;

quit;

%let loopend2=&sqlobs;

Then nesting the rest of the code into %macro something(); %mend; you can use the fact that the following block is equivalent to &a_var in Hai.Kuo's Code.

&a_var1

%do i=2 %to &loopend1;

, &&a_var&i

%end;

On a side note, thanks for this alternative option Hai.Kuo. Everytime I needed to double quote contents using proc sql macro variable allocation, I used to use separated by "," but its less natural to have to add the starting and ending double quotes each time I deref the macro variable. I never even remotely considered the quote function as a way around.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 419 views
  • 6 likes
  • 6 in conversation