BookmarkSubscribeRSS Feed
TMorville
Calcite | Level 5
Hi everyone!

Im doing a piece of work for my employer, where i need to find out how different postnumbercodes are distributed over different countys.

My data looks like this:

Postnumbercode Countycode
Y<> A
Y<> A
Y <> A
Y <> A
Y <> A
Y <><> <> B
Y <> B
Y <> B
Y <> D
Y <> A
Y <> A
Z <> C
Z <> C

So. As you might have noticed, there is some overlap. There are two kinds of overlap:

Acceptable overlap: When postnumbercode changes in the same row as countycode. The above example would be where Z and C changes simultaniously.

Unacceptable overlap: When postnumbercode remains the same thrughout different countycodes. THIS is what i want SAS to recognize. But how?

What i would like SAS to do, is to recognize when the unacceptable overlap happens, and output all those rows in a seperate sheet for me.

Soooo. Any creative ideas?

Thanks a bunch! Message was edited by: TMorville
12 REPLIES 12
Cynthia_sas
SAS Super FREQ
Hi:
If the Z/C rows are acceptable, are the Y/A rows also acceptable?? What about the Y/B rows??

Or, is Z/C acceptable because the POSTNUMBERCODE and COUNTYCODE change at the same time??? Does this mean that Y/A would be acceptable, but Y/B would not be acceptable because COUNTYCODE changed to B, but POSTNUMBERCODE stayed Y??? What about the Y/A rows that occur after the Y/D row??? Are those acceptable?? Is the ordering of POSTNUMBERCODE and COUNTYCODE significant?? Why aren't all the Y/A rows sorted together??? Is there some other variable that is determining an order that you have not shown??

Can you elaborate, specifically on what your output would look like given the small set of rows? Can you show, specifically, what are the acceptable rows and what are the unacceptable rows? Also, can you explain what you mean by "overlap"??? I see that your data has duplicate values for POSTNUMBERCODE and COUNTYCODE and duplicate pairs of values for POSTNUMBERCODE and COUNTYCODE. For example, POSTNUMBERCODE seems to have 2 possible values: Y and Z while COUNTYCODE has 3 possible values: A, B, C and D.

On the other hand, you have several different data conditions for "pairs" of values:
POSTNUMBERCODE = Y and COUNTYCODE= A, B or D
POSTNUMBERCODE = Z and COUNTYCODE = C

Does this mean that POSTNUMBERCODE = Z could NEVER have a COUNTYCODE of A???

Nothing clever, just more questions.

cynthia
TMorville
Calcite | Level 5
Hi everyone, monday morning and back to work. Hope you had a nice weekend.

So, in reply to Cynthia:

1) Y/A are NOT acceptable, because they at row Y/B start to overlap. In this case, i would like SAS to identify that there is the Y/A-Y/B overlap, and then return all of the Y rows.

This means that the ENTIRE Y row is not acceptable, and needs manual truncating, why i need to get all the Y rows in a output.

2) Z/C is acceptable because they change AT THE SAME TIME, exactly as you deduced.

3) Neither Y/A (both orcurrences), Y/B or Y/D is acceptable, b/c countycode changes, while postnumbercode stays the same. Thats the stuff i wan' t to pick up!

4) My output given the sample above, would be ALL the Y rows from Postnummercode. Because there is variation in Countycode, whilst postnumbercode stays the same.

5) Lets say that POSTNUMBERCODE = Z is first found to have COUNTYCODE = A, THEN POSTNUMBERCODE = Z - SHOULD - NEVER have COUNTYCODE = B - and if it happens, i need to identify it. I want to allow variation in postnumbercode over countycode, so POSTNUMBERCODE A,B,C FOR COUNTYCODE = C is acceptable. But POSTNUMBERCODE A for COUNTYCODE = C,D,E is NOT acceptable.

And if there is found "unacceptable" variation, then i would like SAS to output all the postnumbercode in which that variation happened. In the case above, all the Y rows.

-

In reply to Peter.C:

Yes. The deal is, that i have some more variables, and i need a specific numeric order of my data. So sorting around and (dis)allowing some rows is not an option for me.

In reply to SPR:

Sadly doing nodupkey will ruin my dataset as i need to IDENTIFY where the "overlap" specified above happens, and truncate some data manually.

In reply to RSB:

It isen't that the SECOND orcurrence of county for the same postcode is unacceptale. The unacceptable part is that there is variation in countycode, while postcode stays the same! (See reply 4&5 for Cynthia)

-

Well, thanks a bunch for the input already! I hope that my second post cleared up some stuff.

-

This is a link to a small sample of data. I've deleted some sensetive stuff, but the essence of my problem is visible. All the red rows, are the stuff that i want SAS to output to me, while the stuff marked green is acceptable.

Link: http://dl.dropbox.com/u/1321324/Work/sample.xls

-

-T

PS: My data is sorted with POSTNUMBERCODE from smallest to largest. Message was edited by: TMorville
Peter_C
Rhodochrosite | Level 12
So 1:N relation between County:Post is to be tested without sorting data
My simpler solution assumes data are unsorted but grouped by Post[pre]data unacceptable ;
set data end = _enough ;
by post notSorted county notSorted ;
if first.post then _unacc = 0 ;
if _unacc then output ;
retain _unacc _startAt ;
else do ;
if first.post then _startAt = _n_ ;
if first.County and not first.Post then do ;
_unacc = 1 ;
do past= _startAt to _n_ ;
set data point= past ;
output ;
end ;
end ;
end ;
if _enough then stop ;
run ;[/pre]on the test data, this outputs 18 rows
A more complex solution is neccessary if the underlying data might have postNumberCode occurring at different places in file. Then use a hash table to collect postNumberCodes to be reported and pass through the data again.
Ksharp
Super User
OK.
That is much more clearer.
If I have understand your logic totally.
As you said, you have sort the postnumbercode from smallest to largest.


[pre]
data temp;
input post $ code $;
cards;
Y A
Y A
Y A
Y A
Y A
Y B
Y B
Y B
Y D
Y A
Y A
Z C
Z C
;run;
proc sql feedback;
select *
from temp
group by post
having count( distinct code ) ge 2
;quit;
[/pre]


Ksharp
SPR
Quartz | Level 8 SPR
Quartz | Level 8
Hello TMorville,

I can NOT ruin your dataset for 2 reasons:
1) I have no acces to it;
2) If you look on my program attentively I create a NEW dataset in proc SORT.

Anyway, now your request looks more clear and this is a solution:
[pre]
data a;
input p c;
datalines;
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2630 0169
2630 0169
2630 0169
2630 0169
2630 0169
2630 0169
2635 0183
2635 0183
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0183
2640 0183
2640 0253
2640 0265
2640 0265
2640 0265
2640 0265
2640 0265
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2660 0153
2660 0153
;
run;
proc sort data=a out=u nodupkey;
by p c;
run;
data uu;
set u;
if NOT (FIRST.p=1 and LAST.p=1);
by p;
run;
data r;
merge a (in=a) uu(in=uu);
if a and uu then case="UnAccept";
if a and not uu then case="Accept ";
by p;
run;
[/pre]

SPR
TMorville
Calcite | Level 5
Hello again everyone, and thanks for the input!

I've ended up using some of SPR's solution, and now it seems to work. I guess i just need to practice my datasteps.

Well. Thanks again! And have a great december.
Peter_C
Rhodochrosite | Level 12
T
sorry you didn't try the simpler step I offered at http://support.sas.com/forums/thread.jspa?messageID=46470#46470 as it was designed to work without even sorting your data..
Peter_C
Rhodochrosite | Level 12
T
sorry you didn't try the simpler step I offered at http://support.sas.com/forums/thread.jspa?messageID=46470#46470 as it was designed to work without even sorting your data..
TMorville
Calcite | Level 5
Hi Peter. I did try your code at my data, but it simply diden't return any output.

It worked fine on the sample data. I think that my main data might look alittle different than the data i offered in the sample.
Peter_C
Rhodochrosite | Level 12
data appears disordered. Is that important?
When extracting the "unacceptable" are all rows for CountyCode required or just those which break the rules?
(is assumption that CountyCode may not hold more than one PostNumberCode
(the rule? )
SPR
Quartz | Level 8 SPR
Quartz | Level 8
Hello TMorville,

If I understand you correctly, this is a possible solution:
[pre]
data a;
input p $ c $;
datalines;
Y A
Y A
Y A
Y A
Y A
Y B
Y B
Y B
Y D
Y A
Y A
Z C
Z C
;
run;
proc sort data=a out=r nodupkey;
by p c;
run;
[/pre]
Sincerely,
SPR
RSB
Calcite | Level 5 RSB
Calcite | Level 5
Assuming that the second occurence of country for the same post code is unacceptable.

data a;
input p $ c $;
datalines;
Y A
Y A
Y A
Y A
Y A
Y B
Y B
Y B
Y D
Y A
Y A
Z C
Z C
;

run;

proc sort data = a out=b nodupkeys;
by p c;
run;

data accept unaccept;
set b;
by p c;

if first.p and first.c then output accept;
else if last.c then output unaccept;
run;

Hope this helps.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 903 views
  • 0 likes
  • 6 in conversation