GeoffreyBrent Tracker

Re: The correct average please help

GeoffreyBrent — Mon, 07 Apr 2014 23:37:09 GMT

Yeah, renaming as a data set option is usually a more intuitive way to go, I just got lazy in my demo code 🙂

(When I first worked in SAS, nobody mentioned to me that data step commands aren't all executed in order of appearance, which made the learning process... interesting.)

Re: The correct average please help

GeoffreyBrent — Fri, 04 Apr 2014 06:14:46 GMT

I agree that it's best to convert missing values to zeroes where that's what they represent, but that code won't give the result you're expecting.

RENAME statements are executed after the body of the DATA step is run. So your code will create a new variable named "flu_cases" (which will be zero for cases which are missing from flu_reports_by_date, and missing for all others) before attempting to rename _freq_ to flu_cases. This will generate a warning ("Variable flu_cases already exists") and as far as I can tell, you end up with the same values that were in _freq_, complete with missingness.

Probably safer to do the renaming up in the MERGE statement so you can use the new name through the rest of the data step - I just couldn't remember the exact syntax and didn't have time to look it up.

Re: The correct average please help

GeoffreyBrent — Thu, 03 Apr 2014 23:25:49 GMT

This is really more of a stats problem than a programming question, but then I'm more of a statistician than a programmer...

Without context, there's no universal answer to this question of how to calculate the average. There are scenarios in which missing values should be excluded from calculation, others where they should be treated as zeroes, and yet others where missingness makes it impossible to get a sound answer. You need to think about what missing data means in your context, and how the average will be used.

One situation where I commonly encounter missing-as-zero is when merging counts created by PROC SUMMARY/MERGE. For instance, I might have a list of 'flu cases reported, each with a date attached. I can get day-by-day counts:

proc summary data=flu_reports nway;

class day_reported;

output out=flu_reports_by_date;

quit;

I can then merge onto a list of other illnesses reported (assumed to cover all days of the year):

data all_illnesses_by_date;

merge illnesses_except_flu flu_reports_by_date;

by day_reported;

rename _freq_=flu_cases;

run;

If there were no flu cases reported on April 1, then flu_reports_by_date will have no entry for April 1. When I merge it to create all_illnesses_by_date it will have a missing value for flu - but that should be treated as zero. Ideally it'd be changed to zero before calculating averages, but if not, then the missing entry needs to be included in the count when calculating an average.

Re: Log window output

GeoffreyBrent — Thu, 28 Nov 2013 22:56:01 GMT

In addition to Tom's advice, it might be worth trying to figure out why the log is so large.

When I get "log full" type problems, it usually turns out to be because something in my program is generating a very large number of notes/warnings. For instance, if I'm doing something that generates a note at every observation of a 2-million-observation file, that will choke up the log very quickly. One way to identify this is to run the program on a smaller data set so that I can look at the log and see what's taking up the space. If I can identify a step that's spamming the log, I can then fix that problem without needing to shut off logging altogether.

Re: This problem can be solved by pre-school children in 5-10 minutes

GeoffreyBrent — Thu, 06 Sep 2012 01:20:43 GMT

I'm with PG on this one. The SAS "solution" posted can only be written by somebody who's already guessed that f(x) might be a sum of individual digit scores. But at that point, they've already done 99% of the heavy lifting. Somebody who has that idea in mind doesn't even need SAS; it's easy enough to figure out the digit values and verify the results by eye.

Re: How to get rid of these warnings in concatenation!

GeoffreyBrent — Tue, 04 Sep 2012 00:42:52 GMT

If you have the data set open, you can also check the length of a character variable by mousing over the column header. After a little while a tooltip will pop up with variable name, length, and other properties.

Re: how to remove duplicates in SAS 4.3 EG

GeoffreyBrent — Sun, 02 Sep 2012 23:59:02 GMT

Patrick is correct, I was aware of this rather important detail but completely neglected to mention it in my response 😞 In hindsight, my PROC SORT example should have used "by _all_".

Re: how to delete one macro variable for another macro variable?

GeoffreyBrent — Sun, 02 Sep 2012 23:56:36 GMT

I see you already have solutions, but you might (or might not) find that existing SAS regression options can save you the need to write your own macro.

For instance, PROC REG gives you options for stepwise selection (add variables with best explanatory power/remove variables with worst) or by using something like SELECTION=MAXR STOP=&MAXEXPVARS you can force SAS to choose the best n variables.

Re: how to delete one macro variable for another macro variable?

GeoffreyBrent — Fri, 31 Aug 2012 06:07:50 GMT

You can't delete it from x, because it's not in x to begin with.

What you've done here is define &x as the literal string "a1-a10". I assuming you're thinking of this as a list of variables a1, a2, ..., a10, but when you define "%let x=a1-a10" SAS doesn't perform that interpretation. SAS will substitute the characters "a1-a10" in the code where "&x" appears and only then will it interpret it.

Demonstration:

%let x=a1-a3;

data test;

a1=5;

a2=6;

a3=7;

diff=&x;

sum=sum(of &x);

run;

When you run this code, the first thing SAS does is to substitute "a1-a3" where "&x" appears:

data test;

a1=5;

a2=6;

a3=7;

diff=a1-a3;

sum=sum(of a1-a3);

run;

It then interprets based on context. In the first case, it interprets "a1-a3" as a1 minus a3 (= 5-7 = -2); in the second case it interprets it as "a1, a2, a3" (= 5+6+7 = 18).

If you can give some more context on what you're looking to do, people might be able to suggest some other solution.

Re: how to remove duplicates in SAS 4.3 EG

GeoffreyBrent — Fri, 31 Aug 2012 02:20:51 GMT

Assuming you want to remove duplicate observations with the same values?

data have;

input x y;

cards;

1 2

1 4

1 8

2 3

2 4

2 5

1 7

1 8

;

run;

/* PROC SORT method */

proc sort data=have nodup out=want1;

by x y;

quit;

/* PROC SQL method */

proc sql;

create table want2 as select distinct * from have;

quit;

If you want to delete duplicates that are not 100% identical (e.g. all units with the same ID variable regardless of whether they differ in other values) look at the NODUPKEY option in PROC SORT.

Re: PROC PANEL: What values SAS uses for the cross-sectional dummies in fixed-effects model?

GeoffreyBrent — Thu, 23 Aug 2012 01:49:21 GMT

I haven't experimented with PROC PANEL (and on a quick glance, the support documentation doesn't help me) but you may want to check whether it's using "effect coding". I remember being very confused by some other modelling PROCs until I found out about effect coding, which I'd never heard of before. Google should be able to find you a good explanation of the concept.

Re: Selecting the right record...

GeoffreyBrent — Wed, 22 Aug 2012 07:22:36 GMT

Assumptions:

(1) no two records for the same key have overlapping dates (i.e. if birth_da for record 1 < birth_da for record 2 on the same key, then ret_date for record 1 < birth_da for record 2)

(2) birth_da is nonmissing

(3) missing value for ret_date indicates not yet retired (so can only occur in the last record)

So for each key, you want to return the latest birth_da that is not greater than the date?

I think the easiest way to do this in simple SAS is to reverse-sort and use LAG to find the next date:

%let date='15Jul2001'd;

data have;

input recordid h_postal $ birth_da ret_date;

informat birth_da ret_date ANYDTDTE20.;

format birth_da ret_date DATE9.;

cards;

1 A 01Jan2000 01Jun2001

2 A 01Jan2002 12Feb2003

3 B 01Jan2000 01Jun2001

4 B 01Jul2001 12Feb2003

5 C 01Jan1999 31Dec1999

6 C 01Jan2001 01Feb2001

7 C 01Jan2003 12feb2003

8 D 01Jan2005 01Jan2007

9 D 01Jan2009 .

10 E 01Jan2001 .

;

run;

proc sort data=have out=have_reverse;

by h_postal descending birth_da;

quit;

data want;

set have_reverse;

by h_postal;

next_birth_date=lag(birth_da);

if first.h_postal then next_birth_date=.;

/* because we have sorted in reverse date order, first.h_postal indicates that this is the most recent entry for this unit

(highest value of birth_da) */

if (next_birth_date<=ret_date OR missing(ret_date)) AND not missing(next_birth_date) then put "ERROR: dates overlap";

if (birth_da<=&date) AND ((next_birth_date>&date) OR missing(next_birth_date)) then output;

drop next_birth_date;

run;

Edit: before using this you'll need to convert to SAS date variables (which you should be using anyway for this sort of work) and as per Tom's comment, if a date of "190000001" indicates "not yet retired" this should probably be interpreted as missing.

Message was edited by: Geoffrey Brent

Re: DOES SAS 9.3 Supports Excel2010

GeoffreyBrent — Tue, 21 Aug 2012 03:42:16 GMT

I encountered a similar problem when migrating from SAS 9.1 to 9.3. In our case, I think the cause was that our SAS 9.3 server had mistakenly been set to use the 32-bit version of Microsoft Access; once it was reconfigured to the 64-bit version, things worked fine. (Can't give you any more detail on how to do that, our tech support guys handled the detail.)

Re: Do “subsetting if” and “where” clauses do the same thing?

GeoffreyBrent — Wed, 15 Aug 2012 01:04:33 GMT

The two code examples you provide should give identical results, though WHERE may be faster to run.

However there are many cases where WHERE and subsetting IF do not give identical output. Compare:

data test;

length x $5;

input x;

cards;

One

Two

Three

Four

Five

;

run;

data subset_if;

set test;

sequence_number=_n_;

previous_x=lag(x);

if not(x="Three");

run;

data subset_where;

set test;

sequence_number=_n_;

previous_x=lag(x);

where not(x="Three");

run;

In creating SUBSET_IF, we read in and process all lines from TEST before dropping the third observation. Even though the third observation isn't output directly, it still affects what is output: sequence_number goes 1,2,4,5 and previous_x goes " ", "One", "Three", "Four".

In creating SUBSET_WHERE, the observation with x="Three" is deleted at an earlier stage, and there's no sign in the output that it ever existed.

Re: sum and ratio, by year and industry

GeoffreyBrent — Fri, 10 Aug 2012 06:52:30 GMT

I'm a bit confused by "squared revenue" - not obvious why this would be of interest. But I'm assuming TOTREV is supposed to be the sum of squared revenue, not the sum of individual revenue, otherwise the ratio doesn't make much sense either. (You would get different answers depending on whether you worked in dollars or cents.)

Under those assumptions:

proc summary data=have nway;

class herfsic fyear;

var revt_squared;

output out=total_squared_revs (drop=_type_ _freq_) sum(revt_squared)=totrev;

quit;

proc sql;

create table want as select

a.*,

b.totrev,

a.revt_squared/b.totrev as conratio

from have as a join total_squared_revs as b

on (a.herfsic=b.herfsic AND a.fyear=b.fyear);

quit;

Re: check if a character is an alphabet

GeoffreyBrent — Fri, 10 Aug 2012 01:50:08 GMT

Toby, have you tested that PrxMatch code? When I use that one I'm getting zeroes where they shouldn't be.

Re: Finding non-exact matches in one dataset

GeoffreyBrent — Fri, 10 Aug 2012 01:42:57 GMT

If you want a Rolls-Royce solution, there are some good commercial packages out there. But you can get a long way with a little bit of SQL and some knowledge of SOUNDEX and edit distance functions (COMPLEV, COMPGED).

SOUNDEX converts a character string to an expression that gives a rough idea of what it sounds like: vowels are omitted and similar-sounding consonants are lumped together. COMPLEV tells you how many single-character edits it takes to convert one string into another.

For instance, SOUNDEX("John") = J5. SOUNDEX("Johann") also equals J5, and SOUNDEX("Susan") = S25. Using COMPLEV to compare the SOUNDEX values tells us that "John" and "Johann" are very similar (score 0), but "John" and "Susan" are less similar (score 2).

COMPGED is a more sophisticated version of COMPLEV that accounts for the sorts of errors that are most commonly made: e.g. "Simon"->"Simmon" is more likely than "Simon"->"Simkon".

You can use these functions (and others of your choice) to generate a score for possible matches. Each pair of observations ends up with a score, and you use a cutoff to determine which ones should be considered as possible matches. Here's an example:

%let soundweight=20;

%let gedisweight=0.1;

%let cutoff=75;

data have;

id=_n_;

input firstname $ lastname $;

firstname=upcase(firstname); /* COMPGED is case-sensitive */

lastname=upcase(lastname);

firstnamesound=soundex(firstname);

lastnamesound=soundex(lastname);

cards;

Susan Smith

John Smith

Sue Smith

Johann Schmidt

Sue Jones

Sam Snell

Joe Johnson

Jae Johnston

;

run;

proc sql;

create table want as select

a.id as id1, b.id as id2,

a.firstname as firstname1, b.firstname as firstname2,

a.lastname as lastname1, b.lastname as lastname2,

a.firstnamesound as firstnamesound1, b.firstnamesound as firstnamesound2,

/* The next few variables don't need to be here - they are recalculated separately in the join condition below.

I've included them here so you can see what these intermediate functions look like before they're combined to

generate the overall match score. */

complev(a.firstnamesound,b.firstnamesound) as firstnamesoundscore,

complev(a.lastnamesound,b.lastnamesound) as lastnamesoundscore,

compged(a.firstname,b.firstname) as firstnameeditscore,

compged(a.lastname,b.lastname) as lastnameeditscore,

&soundweight*(calculated firstnamesoundscore + calculated lastnamesoundscore)

+&gedisweight*(calculated firstnameeditscore + calculated lastnameeditscore) as matchscore

from have as a inner join have as b on (a.id < b.id /* prevents duplicates and self-matches */ AND

(

&soundweight*(complev(a.firstnamesound,b.firstnamesound)+complev(a.lastnamesound,b.lastnamesound))

+&gedisweight*(compged(a.firstname,b.firstname)+compged(a.lastname,b.lastname))

)

<&cutoff);

quit;

You may want to play around with the weights and the scoring function, especially if you have other data that could be used to enhance the match. Raising the cutoff will increase the likelihood of accepting a match, so you'll get more false positives but fewer false negatives. Lowering it has the reverse effect.

Re: check if a character is an alphabet

GeoffreyBrent — Thu, 09 Aug 2012 00:55:37 GMT

Another option is using regular expressions:

data t2;

set t1;

one_alpha_rx=prxparse("/[a-zA-Z]/");

ind=prxmatch(one_alpha_rx,x);

drop one_alpha_rx;

run;

For the problem you describe, you'd be better off using Art and Linlin's solutions - they're simpler and probably faster. But if you have to check more complex patterns some time, it's worth learning about regexp matching (this webform won't let me copy and paste, but there's a good paper on this in the SUGI29 archives).

As an example of where regexp comes in handy, I had an application where ID variables were expected to be twelve digits followed by an alpha character and then four more digits. To check whether inputs fit this rule, I used:

legal_pattern=prxparse("/\d{12}[a-zA-Z]\d{4}/");

Note that regexp matching doesn't check whether the variable EXACTLY matches the pattern defined, only whether it appears somewhere in there. But this isn't a problem if the length of the regexp exactly matches the length of the variable.

Re: proc summary procedure giving difference in sum when summarized by different classes

GeoffreyBrent — Tue, 07 Aug 2012 01:22:41 GMT

Yeah, looks like roundoff errors to me. A length-8 SAS numeric variable is stored with 52-56 bits in the mantissa (depending on your system), which translates to around 15-17 significant figures in base-10. So when you store a number of magnitude ~ 1E21, you can expect rounding errors of magnitude ~ 100,000.

Some demo code:

data test;

input x;

cards;

1E4

1E21

-1E21

;

run;

proc summary data=test nway;

var x;

output out=test2 sum=;

quit;

On my system, this returns zero: when we add 1E4 to 1E21, it gets rounded away to zero. However, if I reorder the data so that the "1E4" appears last in the data step, the two big values cancel out exactly, and we get the correct sum.

Some options:

- accept the rounding error

- find a way to implement calculations in higher precision (see 2008 SGF paper "Ludicrously Large Numbers" for some ideas)

- fine-tune the SAS calculation to minimise rounding error.

If the values you are adding are mostly positive, I'd go with the first strategy suggested by mkeintz: pick up the small values early so they can add up to something that won't be wiped out by roundoff error.

However, looking at the values you've listed above, it seems you have several very large positives and negatives that almost exactly cancel one another out. For instance, your two largest negatives almost exactly cancel the largest positive; the fourth- and fifth-largest negatives cancel the second-largest positive; and the third-largest on both lists almost cancel out.

Is there some reason why this might be the case? And if so, is there some way of matching up values that are likely to cancel out?

Take the following example:

data test;

length group $1;

input group x;

cards;

A 1E4

B 1E21

B -2E21

C 1E10

C -2E10

;

run;

proc summary data=test;

var x;

output out=test2 sum=;

quit;

This returns zero, because the original 1E4 is wiped out by rounding error. But add "class group;" to the PROC SUMMARY and you get the right answer: it forces large values to cancel before combining them with small ones.

Re: Number of Decimal Places

GeoffreyBrent — Tue, 31 Jul 2012 06:37:43 GMT

Adding to Arthur's advice: even if you don't want to count trailing zeroes, the methods above should ONLY be used on character inputs.

On my computer, all three of the code examples above give the wrong answer for an input of 0.000001 or 10000000.1234. The reason for this is that their inputs are defined as numeric by default. When you perform a string operation (STRIP, INDEX, SUBSTR, CAT) on a numeric variable you force SAS to do an implicit conversion from num to char, and the format it selects may not be the one you're expecting. Watch out for that "NOTE: Character values have been converted to numeric values" in the log.

In this case, a numeric value of 0.000001 gets converted to "1E-6" not "0.000001" and 10000000.1234 gets converted to "1000000.123", both of which lead to the wrong answer. Two of these examples assume that there will be a decimal point, and so they give the wrong answer for an integer input.

Possibly worth adding a check in the program to return a warning/error message if this is supplied with numeric input, since it would be a VERY easy mistake to make.