About Jarvin99

Tom · ‎12-15-2023

You can use the UPDATE statement to implement last observation carried forward. To prevent it from applying to the other variables just re-read the observation without the ones you do want carried forward. data want; update have(obs=0) have; by gvkey; set have(drop=shrcd); output; run; If you want to move values backwards in time then you will need a different method.

mkeintz · ‎12-13-2023

@Jarvin99 wrote: Hi, Thank you so much for your code. I managed to solve my problem using the following codes: data education; set org.na_dir_profile_education; if findw(upcase(qualification), 'BACHELOR')>=1 then bachelor = 1; if findw(upcase(qualification), "BACHELOR'S DEGREE")>=1 then bachelor = 1; if findw(upcase(qualification), 'BS')>=1 then bachelor = 1; ... many lines deleted ... if findw(upcase(qualification), 'BInfTech')>=1 then bachelor = 1; if findw(upcase(qualification), 'BJ')>=1 then bachelor = 1; run; May I know if there is simplified way like writing a macro to shorten the above, as there are at least 30-40 qualified words for my bachelor indicator? In addition to the comments others have made about using ELSE IF construct's to avoid doing superfluous IF test's, you should consider using a _TEMPORARY array of the search terms as a code-saving device, as in: data education (drop=i _:); set org.na_dir_profile_education; array _text_ba {10} $20 _temporary_ ('BACHELOR','BS','BSC','BE','BSE','BENG','BA', 'BAS','BASc','BAppSc') bachelor=0; do i=1 to dim(_text_ba) until (bachelor=1); bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1); end; run; The above ignores upper/lower case issues, which are easily addressed. More important is the issue of searching for two-word phrases ("bigrams" in this note. One-word phrases are unigrams). FINDW is not meant to find them. Here's a workaround, which divides your search terms into unigrams and bigrams: data education (drop=i _:); set org.na_dir_profile_education; set have; array _unigrams_ba {26} $20 _temporary_ ('BACHELOR','BS','BSC','BE','BSE','BENG','BA', 'BAS','BASc','BAppSc','BArch','BBA','BBM','BBS','BCA', 'BCL','BCom','BComm','BCompt','BEc','BEcon','BEd','BFA', 'BInf','BInfTech','BJ'); array _bigrams_ba {6} $20 _temporary_ ("BACHELOR'S DEGREE",'B Acc','B Arch','B.Acc','B.Math','B.Proc'); /*Make everything upper-case, for finding purposes */ _upcase_qual=upcase(qualification); if _n_=1 then do; do i=1 to dim(_unigrams_ba); unigrams_ba{i}=upcase(_unigrams_ba{i}); end; do i=1 to dim(_bigrams_ba); bigrams_ba{i}=upcase(_bigrams_ba{i}); end; end; bachelor=0; do i=1 to dim(_unigrams_ba) until (bachelor=1); bachelor=(findw(_upcase_qual,trim(_unigrams_ba{i}))>=1); end; if bachelor=0 then do i=1 to dim(_bigrams_ba) until (bachelor=1); _w1=findw(_upcase_qual,trim(scan(_bigrams_ba{i},1)),' .','e'); _w2=findw(_upcase_qual,trim(scan(_bigrams_ba{i},2)),' .','e'); bachelor=(_w2=_w1+1) and (_w1>0); end; run; The "trick" here in dealing with bigrams is to use a feature of FINDW (the 'e' as the 4th parameter of FINDW) that returns the word-sequence number rather than the character position of a search-word inside a string-of-words. The benefit is that if you are searching for "Bachelor's degree", you want to know if the word number of "degree" is one greater than the word number of "Bachelor's". Of course, that is not bullet proof, since it doesn't protect against any of the words appearing more than once, masking detection of the proper sequence. Code can be written to avoid this, but this code is a little more self-evident. The third argument of FINDW ( ' .') tells the function that only those two characters are word delimiters.

Tom · ‎09-24-2023

I don't think many companies change FY in the middle of a month, so for the START dates that are only missing the day of the month just assume the first of the month. And for the START dates that only include a year what is the logic you want to use? Do you assume it is the same as the FY? Do you want to assume that START was the first of the year? The first of July? The first of the month that company uses to start a new FY?

Reeza · ‎10-20-2022

@Jarvin99 wrote: Also, do you mean it is 1-to-1, so there is no optimization in this case? It's a one to all/many join. Every record from table A is joined to every record in Table B. If you have 10 records in TableA and 20 records in TableB, there will 10*20 comparisons and 200 records generated if you do not filter the results. If in TableB, there are 3 empty rows, then there will be 10*3 = 30 empty records for the second variable in the data set. If you have a large data set this can be very computationally intensive.

Kurt_Bremser · ‎08-22-2022

First of all, you talk of SQL, and IF is not a valid keyword in SQL language. To subset observations in SQL, you need to use WHERE: data have; input codeid $4.; datalines; a111 a112 a113 a114 ; proc sql; create table want1 as select * from have where codeid = "a111" or "a112" or "a113" ; quit; But then look at the log: 78 proc sql; 79 create table want1 as 80 select * 81 from have 82 where codeid = "a111" or "a112" or "a113" 83 ; NOTE: Table WORK.WANT1 created, with 4 rows and 1 columns. No observation was filtered out, although "a114" is not in your condition. The reason becomes clear when WHERE is used in a DATA step: data want2; set have; where codeid = "a111" or "a112" or "a113"; run; because now the log tells you this: 69 data want2; 70 set have; 71 where codeid = "a111" or "a112" or "a113"; 72 run; NOTE: There were 4 observations read from the data set WORK.HAVE. WHERE 1 /* eine offensichtlich WAHRE Where-Bedingung (TRUE) */ ; NOTE: The data set WORK.WANT2 has 4 observations and 1 variables. There's an obvious TRUE condition in your WHERE. Why? The condition is equivalent to this: where (codeid = "a111") or ("a112") or ("a113"); OR separates several conditions to form a compound condition, so the second and third value become conditions on their own. By definition, any non-missing character value evaluates to TRUE. While WHERE in a DATA step is handed off to the dataset engine, which accepts SQL syntax, the IF is compiled by the data step compiler as data step code, and there conditions must be SAS Boolean values (numeric, zero or missing is FALSE, everything else is TRUE). data want3; set have; if codeid = "a111" or "a112" or "a113"; run; Log: 69 data want3; 70 set have; 71 if codeid = "a111" or "a112" or "a113"; 72 run; NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 71:23 71:33 NOTE: Invalid numeric data, 'a112' , at Zeile 71 Spalte 23. NOTE: Invalid numeric data, 'a113' , at Zeile 71 Spalte 33. codeid=a112 _ERROR_=1 _N_=2 NOTE: Invalid numeric data, 'a112' , at Zeile 71 Spalte 23. NOTE: Invalid numeric data, 'a113' , at Zeile 71 Spalte 33. codeid=a113 _ERROR_=1 _N_=3 NOTE: Invalid numeric data, 'a112' , at Zeile 71 Spalte 23. NOTE: Invalid numeric data, 'a113' , at Zeile 71 Spalte 33. codeid=a114 _ERROR_=1 _N_=4 NOTE: There were 4 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT3 has 1 observations and 1 variables. The data step compiler compiles an implicit conversion of character to numeric, but at runtime invalid numeric values are encountered, so the conversion results in missing (FALSE) values, and only when the first part of the condition is met, the whole condition is TRUE. So this is the reason why you have to use the IN operator and a list of values in parentheses for your code to work as intended. Now, anytime you need to work with lists, it is a good idea to store them in their own dataset and use that to make your selection, either by JOINing/MERGEing, by using a hash object, a format, or by creating dynamic code through CALL EXECUTE in a DATA step or SELECT INTO in SQL. My personal favorite is the hash object: data lookup; input codeid $4.; datalines; a111 a112 a113 ; data want4; set have; if _n_ = 1 then do; declare hash l (dataset:"lookup"); l.definekey("codeid"); l.definedone(); end; if l.check() = 0; /* zero means key was found */ run; Log: 77 data want4; 78 set have; 79 if _n_ = 1 80 then do; 81 declare hash l (dataset:"lookup"); 82 l.definekey("codeid"); 83 l.definedone(); 84 end; 85 if l.check() = 0; /* zero means key was found */ 86 run; NOTE: There were 3 observations read from the data set WORK.LOOKUP. NOTE: There were 4 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT4 has 3 observations and 1 variables. The lookup table is sorted in memory into a b-tree, and no sorting has to be done before this step; the order of the "have" dataset is kept. This is the fastest method which SAS provides for lookup tasks. You can even combine an arbitrary number of lookups in one step, as long as the lookups fit into available memory.

andreas_lds · ‎07-11-2022

A merge could be used, but requires renaming the variables of one dataset: data want; merge have_a have_b(rename=(env= b_env water=b_water transport = b_transport production = b_production)); by id; env = env or b_env; water = water or b_water; transport = transport or b_transport; production = production or b_production; drop b_:; run;

AlexBennasar · ‎07-08-2022

want=substr(have,5);

Tom · ‎06-27-2022

Please show the actual log. Most likely the path you used does not work on the machine where SAS is actually running. Make sure you have copied the file to that machine and you can point SAS at the file. Other issues you will have. 1) Some of the lines in the file are longer than the default 32,767 bytes. Set a longer LRECL= option on the INFILE statement. 2) You are truncating the ABSTRACT field. In fact some of the values are longer than the maximum 32,767 bytes that SAS can store in one variable. 3) Two of the lines in the file have embedded linefeeds that will cause SAS to treat the line as two lines. You might want to use something like this to fix the file before trying to read it with SAS. https://github.com/sasutils/macros/blob/master/replace_crlf.sas

Kurt_Bremser · ‎06-25-2022

The number of digits which can be precisely stored does not limit the number of digits to the left of the decimal point, these are only limited by the maximum exponent of the 8-byte real storage. So you can easily store numbers used by astrophysicists, just with precision guaranteed only for the 15 most signiificant digits. The maximum readable field width for numeric informats is 32.

Kurt_Bremser · ‎05-09-2022

You do not want to change the length of a numeric variable, as that concerns the number of bytes used to store the numbers. What you want to change is the display format for the numeric values. To reliably display up to 14 digits plus 2 fractional digits, use a format of 18.2. 18 = 14 + 1 (dot) + 1 (sign) + 2 (fractional digits) In the code I gave you, extend the FORMAT statement: format TRANSACTION_DT yymmdd10. TRANSACTION_AMT 18.2 ; Note that a number with more than 15 overall decimal digits will have imprecisions in the last digits, because of the limits of 8-byte real storage.

Reeza · ‎04-22-2022

Aren't tweets text data? How are deciles being calculated for that? Or should it be based on time of day for the tweets? You have your deciles on the variable tweets. You're saying the variable tweets doesn't have any duplicates? So no retweets in the data?

Tom · ‎04-18-2022

@Jarvin99 wrote: Hi, Sorry I have a question again. I tried to add one more delimiter "to become". data test; set have; loc1=findw(description,'and'); loc2=findw(description,'but'); loc3=findw(description,'to become'); if loc1 and loc2 then loc=min(loc1, loc2); if loc1 and loc3 then loc=min(loc1, loc3); if loc2 and loc3 then loc=min(loc2, loc3); else if loc1 then loc=loc1; else if loc2 then loc=loc2; else if loc3 then loc=loc3; if loc then description_1=substrn(description,1,loc-1); run; But, it does not work anymore. loc dose not return me the smallest number possible when comparing loc1 and loc2. What should I do? Thank you. Your IF conditions are wrong when there are three instead of two search terms. Perhaps a different algorithm would be easier to extend when there are more than two search terms. data test; set have; loc=.; loc_next=findw(description,'and'); if loc_next then loc=min(loc,loc_next); loc_next=findw(description,'but'); if loc_next then loc=min(loc,loc_next); loc_next=findw(description,'to become'); if loc_next then loc=min(loc,loc_next); if loc>1 then description_1=substr(description,1,loc-1); run;

Tom · ‎04-16-2022

The usual issue that causes that is embedded "DOS" end of file character. Use the IGNOREDOSEOF option on the INFILE statement. IGNOREDOSEOF is used in the context of I/O operations on variable record format files. When this option is specified, any occurrence of ^Z is interpreted as character data and not as an end-of-file marker. Why use PROC IMPORT to GUESS how to read a file that only has NINE variables? Just write your own data step and you will have full control over how the variables are named, defined, labeled and whether or not any formats need to be attached. Are those last four variables really just plain numbers? Why aren't the two DATE variables using a date type informat to create actual date values? Why are the two ID variables being read as numbers instead of character strings? You do not need to perform arithmetic with ID variables. What do the lines in the file actually have for those fields? data announcement ; infile 'D:\Dropbox\Dataset\9ee779962de5b464.csv' dsd ignoredoseof truncover firstobs=2; length CompanyName $50 DirectorName $30 CommitteeName $20 JobName $50 Description $200 AnnouncementDate 8 CompanyID 8 DirectorID 8 EffectDate 8 ; input CompanyName -- EffectDate ; run; To see some example values from the file use a simple data step. data _null_; infile 'D:\Dropbox\Dataset\9ee779962de5b464.csv' obs=5 ; input; list; run;

Ksharp · ‎04-10-2022

data have; input ID rolename $50.; cards; 1 vp/ceo 1 division vp/ceo 1 global ceo/cfo/coo 2 vice ceo/division cfo 2 vice ceo/cfo 2 division ceo/coo/cfo ; data want; set have; length want $ 200; want=rolename; if rolename =: 'division' then want=prxchange('s/\//\/division /o',-1,rolename); if rolename =: 'global' then want=prxchange('s/\//\/global /o',-1,rolename); run;

andreas_lds · ‎04-05-2022

And another solution: data work.want; set work.have; if _n_ = 1 then do; declare hash h(dataset: 'work.have(keep= id year indicator where= (indicator=0))'); h.defineKey('id', 'year'); h.defineDone(); end; new_indicator = h.check() = 0; run;

Online Status	Offline
Date Last Visited	‎12-17-2023 05:28 AM

fill in missing values with the previous values for panel

Re: match the exact same word

match the exact same word

Re: keep multiple rows based the conditions

Re: keep multiple rows based the conditions

Re: keep multiple rows based the conditions

keep multiple rows based the conditions

Re: fuzzy matching

fuzzy matching

Re: a list of values to filter

Re: fill in missing values with the previous values for panel

Re: How to keep all variables from orginal dataset when using proc fre...

Re: 1 year (252 days) Buy and hold abnormal return for daily data

Re: 1 year (252 days) Buy and hold abnormal return for daily data

Re: 1 year (252 days) Buy and hold abnormal return for daily data

Re: fill in missing values with the previous values for panel

Re: match the exact same word

Re: keep multiple rows based the conditions

Re: fuzzy matching

Re: a list of values to filter

Re: How to merge two datasets with the same column names into one data...

Re: extract any words after the fourth location

Re: how to import tsv file

Re: Import csv with variables of infinite decimals

Re: Import a TXT file with data split by a symbol

Re: Split into deciles based on dates

Re: keep all the words before a certain word

Re: import a large csv

Re: split and rejoin words

Re: As long as a column contains a value then the other column changes...