Re: I have absolutely no idea how to solve this..!

TMorville · Posted 11-25-2010 10:00 AM

Hi everyone!

Im doing a piece of work for my employer, where i need to find out how different postnumbercodes are distributed over different countys.

My data looks like this:

Postnumbercode Countycode
Y<> A
Y<> A
Y <> A
Y <> A
Y <> A
Y <><> <> B
Y <> B
Y <> B
Y <> D
Y <> A
Y <> A
Z <> C
Z <> C

So. As you might have noticed, there is some overlap. There are two kinds of overlap:

Acceptable overlap: When postnumbercode changes in the same row as countycode. The above example would be where Z and C changes simultaniously.

Unacceptable overlap: When postnumbercode remains the same thrughout different countycodes. THIS is what i want SAS to recognize. But how?

What i would like SAS to do, is to recognize when the unacceptable overlap happens, and output all those rows in a seperate sheet for me.

Soooo. Any creative ideas?

Thanks a bunch! Message was edited by: TMorville

Cynthia_sas · Posted 11-25-2010 10:24 AM

Hi:
If the Z/C rows are acceptable, are the Y/A rows also acceptable?? What about the Y/B rows??

Or, is Z/C acceptable because the POSTNUMBERCODE and COUNTYCODE change at the same time??? Does this mean that Y/A would be acceptable, but Y/B would not be acceptable because COUNTYCODE changed to B, but POSTNUMBERCODE stayed Y??? What about the Y/A rows that occur after the Y/D row??? Are those acceptable?? Is the ordering of POSTNUMBERCODE and COUNTYCODE significant?? Why aren't all the Y/A rows sorted together??? Is there some other variable that is determining an order that you have not shown??

Can you elaborate, specifically on what your output would look like given the small set of rows? Can you show, specifically, what are the acceptable rows and what are the unacceptable rows? Also, can you explain what you mean by "overlap"??? I see that your data has duplicate values for POSTNUMBERCODE and COUNTYCODE and duplicate pairs of values for POSTNUMBERCODE and COUNTYCODE. For example, POSTNUMBERCODE seems to have 2 possible values: Y and Z while COUNTYCODE has 3 possible values: A, B, C and D.

On the other hand, you have several different data conditions for "pairs" of values:
POSTNUMBERCODE = Y and COUNTYCODE= A, B or D
POSTNUMBERCODE = Z and COUNTYCODE = C

Does this mean that POSTNUMBERCODE = Z could NEVER have a COUNTYCODE of A???

Nothing clever, just more questions.

cynthia

TMorville · Posted 11-29-2010 04:13 AM

Hi everyone, monday morning and back to work. Hope you had a nice weekend.

So, in reply to Cynthia:

1) Y/A are NOT acceptable, because they at row Y/B start to overlap. In this case, i would like SAS to identify that there is the Y/A-Y/B overlap, and then return all of the Y rows.

This means that the ENTIRE Y row is not acceptable, and needs manual truncating, why i need to get all the Y rows in a output.

2) Z/C is acceptable because they change AT THE SAME TIME, exactly as you deduced.

3) Neither Y/A (both orcurrences), Y/B or Y/D is acceptable, b/c countycode changes, while postnumbercode stays the same. Thats the stuff i wan' t to pick up!

4) My output given the sample above, would be ALL the Y rows from Postnummercode. Because there is variation in Countycode, whilst postnumbercode stays the same.

5) Lets say that POSTNUMBERCODE = Z is first found to have COUNTYCODE = A, THEN POSTNUMBERCODE = Z - SHOULD - NEVER have COUNTYCODE = B - and if it happens, i need to identify it. I want to allow variation in postnumbercode over countycode, so POSTNUMBERCODE A,B,C FOR COUNTYCODE = C is acceptable. But POSTNUMBERCODE A for COUNTYCODE = C,D,E is NOT acceptable.

And if there is found "unacceptable" variation, then i would like SAS to output all the postnumbercode in which that variation happened. In the case above, all the Y rows.

-

In reply to Peter.C:

Yes. The deal is, that i have some more variables, and i need a specific numeric order of my data. So sorting around and (dis)allowing some rows is not an option for me.

In reply to SPR:

Sadly doing nodupkey will ruin my dataset as i need to IDENTIFY where the "overlap" specified above happens, and truncate some data manually.

In reply to RSB:

It isen't that the SECOND orcurrence of county for the same postcode is unacceptale. The unacceptable part is that there is variation in countycode, while postcode stays the same! (See reply 4&5 for Cynthia)

-

Well, thanks a bunch for the input already! I hope that my second post cleared up some stuff.

-

This is a link to a small sample of data. I've deleted some sensetive stuff, but the essence of my problem is visible. All the red rows, are the stuff that i want SAS to output to me, while the stuff marked green is acceptable.

Link: http://dl.dropbox.com/u/1321324/Work/sample.xls

-

-T

PS: My data is sorted with POSTNUMBERCODE from smallest to largest. Message was edited by: TMorville

Peter_C · Posted 11-29-2010 08:11 AM

So 1:N relation between County:Post is to be tested without sorting data
My simpler solution assumes data are unsorted but grouped by Post[pre]data unacceptable ;
set data end = _enough ;
by post notSorted county notSorted ;
if first.post then _unacc = 0 ;
if _unacc then output ;
retain _unacc _startAt ;
else do ;
if first.post then _startAt = _n_ ;
if first.County and not first.Post then do ;
_unacc = 1 ;
do past= _startAt to _n_ ;
set data point= past ;
output ;
end ;
end ;
end ;
if _enough then stop ;
run ;[/pre]on the test data, this outputs 18 rows
A more complex solution is neccessary if the underlying data might have postNumberCode occurring at different places in file. Then use a hash table to collect postNumberCodes to be reported and pass through the data again.

Ksharp · Posted 11-29-2010 08:38 PM

OK.
That is much more clearer.
If I have understand your logic totally.
As you said, you have sort the postnumbercode from smallest to largest.

[pre]
data temp;
input post $ code $;
cards;
Y A
Y A
Y A
Y A
Y A
Y B
Y B
Y B
Y D
Y A
Y A
Z C
Z C
;run;
proc sql feedback;
select *
from temp
group by post
having count( distinct code ) ge 2
;quit;
[/pre]

Ksharp

SPR · Posted 11-30-2010 09:41 AM

Hello TMorville,

I can NOT ruin your dataset for 2 reasons:
1) I have no acces to it;
2) If you look on my program attentively I create a NEW dataset in proc SORT.

Anyway, now your request looks more clear and this is a solution:
[pre]
data a;
input p c;
datalines;
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2625 0187
2630 0169
2630 0169
2630 0169
2630 0169
2630 0169
2630 0169
2635 0183
2635 0183
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0169
2640 0183
2640 0183
2640 0253
2640 0265
2640 0265
2640 0265
2640 0265
2640 0265
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2650 0167
2660 0153
2660 0153
;
run;
proc sort data=a out=u nodupkey;
by p c;
run;
data uu;
set u;
if NOT (FIRST.p=1 and LAST.p=1);
by p;
run;
data r;
merge a (in=a) uu(in=uu);
if a and uu then case="UnAccept";
if a and not uu then case="Accept ";
by p;
run;
[/pre]

SPR

TMorville · Posted 12-02-2010 07:26 AM

Hello again everyone, and thanks for the input!

I've ended up using some of SPR's solution, and now it seems to work. I guess i just need to practice my datasteps.

Well. Thanks again! And have a great december.

Peter_C · Posted 12-02-2010 08:02 AM

T
sorry you didn't try the simpler step I offered at http://support.sas.com/forums/thread.jspa?messageID=46470#46470 as it was designed to work without even sorting your data..

Peter_C · Posted 12-02-2010 08:04 AM

T
sorry you didn't try the simpler step I offered at http://support.sas.com/forums/thread.jspa?messageID=46470#46470 as it was designed to work without even sorting your data..

TMorville · Posted 12-02-2010 09:05 AM

Hi Peter. I did try your code at my data, but it simply diden't return any output.

It worked fine on the sample data. I think that my main data might look alittle different than the data i offered in the sample.

Peter_C · Posted 11-25-2010 11:40 AM

data appears disordered. Is that important?
When extracting the "unacceptable" are all rows for CountyCode required or just those which break the rules?
(is assumption that CountyCode may not hold more than one PostNumberCode
(the rule? )

SPR · Posted 11-25-2010 11:51 AM

Hello TMorville,

If I understand you correctly, this is a possible solution:
[pre]
data a;
input p $ c $;
datalines;
Y A
Y A
Y A
Y A
Y A
Y B
Y B
Y B
Y D
Y A
Y A
Z C
Z C
;
run;
proc sort data=a out=r nodupkey;
by p c;
run;
[/pre]
Sincerely,
SPR

RSB · Posted 11-26-2010 02:12 AM

Assuming that the second occurence of country for the same post code is unacceptable.

data a;
input p $ c $;
datalines;
Y A
Y A
Y A
Y A
Y A
Y B
Y B
Y B
Y D
Y A
Y A
Z C
Z C
;

run;

proc sort data = a out=b nodupkeys;
by p c;
run;

data accept unaccept;
set b;
by p c;

if first.p and first.c then output accept;
else if last.c then output unaccept;
run;

Hope this helps.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away