Re: Finding similar coordinates

deleted_user · Posted 06-23-2010 07:36 PM

Hey guys!

After struggling and googling this for well over a day now, I decided to give up and ask for help.

I'm a beginner SAS EG user who is trying to use the software to analyze data for my thesis (and no, my thesis is not about SAS EG 😄 ).

Among other things, the data has lot of coordinates for a lot of people (over 100k rows). So I have an x column for each person and an y column. Now I'm trying to find a way to count the number of clients that share the exact same x and y coordinates. Also a print out of all the client_id's that share the same location would help a lot.

Although there are a lot of clients, there shouldn't be too many that share the same coordinates. Enough though to give me a headache.

Any help would be extremely appreciated!

Cynthia_sas · Posted 06-23-2010 08:29 PM

Hi:
At first, for investigative purposes, this seems to be a simple counting problem that you could do with any number of procedures. I chose PROC TABULATE because it creates an automatic _TYPE_ variable in the output dataset that can be used to determine which of the CLASS variables contributed to the count. (look at the first PROC PRINT output -- it should become clearer what/how _TYPE_ can be used.)

If you look at the program below, there are 26 observations, each with a PERSON, XVAL and YVAL variables. Some folks have XVAL the same; some folks have YVAL the same and some folks have both values the same.

The PROC TABULATE step creates one output dataset with the counts for ALL XVALS, ALL YVALS and the combinations of XVAL and YVAL. The different PROC PRINTS show you ALL the obs in the output dataset versus just the XVALs GT 1, just the YVALs GT 1 and both the same GT 1.

cynthia

[pre]
data coord;
infile datalines;
input person $ xval yval;
return;
datalines;
alan 1 2
barb 1 3
carl 1 4
dave 2 5
edna 2 5
frank 2 6
georgia 2 6
harry 2 7
ida 3 2
jack 3 3
kathy 3 3
lewis 3 4
mary 3 5
norm 3 5
ophelia 4 3
peter 4 4
quentin 4 4
rose 4 5
steve 4 5
tara 4 6
umberto 4 7
victor 4 8
wilma 4 8
xavier 5 2
yarah 5 2
zack 5 3
;
run;

** Make an output dataset of the counts;
proc tabulate data=coord f=comma6. out=c_out(drop=_PAGE_ _TABLE_);
class xval yval;
table xval,n;
table yval,n;
table xval*yval,n;
run;

ods listing;
proc print data=c_out;
title "look at counts TYPE 10 is XVAL only, TYPE 01 is YVAL only";
title2 "and TYPE 11 is XVAL and YVAL";
run;

proc print data=c_out;
title 'Xvals with duplicates';
where _TYPE_ = '10' and N gt 1;
run;

proc print data=c_out;
title 'Yvals with duplicates';
where _TYPE_ = '01' and N gt 1;
run;

proc print data=c_out;
title 'Both Coords with duplicates';
where _TYPE_ = '11' and N gt 1;
run;

*** now that you have these values in a dataset, if you want to;
*** get the individual observations, you can go back and join;
*** these datasets with the original data to extract the folks;
*** who share coordinates.;

[/pre]

deleted_user · Posted 06-24-2010 10:13 AM

Thank you for the quick reply, Cynthia!

Since I have a lot of coordinates, I cut the code down to "Both Coords with duplicates." There seems to be quite a lot of people sharing coordinates.

There's an awful lot I need to learn before I can start efficiently working with SAS and editing code (for example I don't understand how SAS knows to compare the right colums). So as a next step, I'm trying to learn the basics so I can go on to a) reporting the customer numbers that share coordinates b) using further criteria (such as opposite sex, age) to identify married couples.

Cynthia_sas · Posted 06-24-2010 11:48 AM

Hi:
PROC TABULATE is like the "GINSU steak knife" of SAS -- it slices and dices and counts your data however you want. Since XVAL and YVAL are defined as CLASS usage items (the variable values set categories to count), then for TABULATE, every unique value of XVAL is counted in the first TABLE statement : table xval,n; N is the keyword statistic for the count. The same is true for the second TABLE statement with table yval,n;

Then, when TABULATE gets to the 3rd table statement
table xval*yval,n;
The * plays an important role here -- it is a table operator that says to "nest" or "stack" values of the YVAL variable within the unique XVAL variables -- thus giving you the unique combos of both variables taken together.

There are a lot of user group papers and tutorials about PROC TABULATE. If you use this Google search string:
TABULATE tutorial SAS

The first 4 or 5 hits that come up will go into a lot more detail on the use of PROC TABULATE. Eventually, you may need to use a DATA step program to create other variables for counting and there are tutorials out there on that topic as well.

Good luck!

cynthia

Ksharp · Posted 06-24-2010 11:26 PM

Hi.
Not understand your mean totally.
It would be better to give what your output looks like.
After Cynthia@sas ' s dataset.

[pre]
data coord;
infile datalines;
input person $ xval yval;
return;
datalines;
alan 1 2
barb 1 3
carl 1 4
dave 2 5
edna 2 5
frank 2 6
georgia 2 6
harry 2 7
ida 3 2
jack 3 3
kathy 3 3
lewis 3 4
mary 3 5
norm 3 5
ophelia 4 3
peter 4 4
quentin 4 4
rose 4 5
steve 4 5
tara 4 6
umberto 4 7
victor 4 8
wilma 4 8
xavier 5 2
yarah 5 2
zack 5 3
;
run;

proc freq data=coord;
tables xval*yval / out=count list nopercent nocum;
run;
proc sql ;
select a.person,b.xval,b.yval,b.count
from coord as a right join count as b
on a.xval = b.xval and a.yval = b.yval;
quit;
[/pre]

Ksharp

Finding similar coordinates