Monday, February 27, 2006

st: RE: help cleaning string variable

I don't know if it's necessary in Stata 9 - might have been put into the official egen package if it was used enough - but for Stata 8 the fabulous ado package -egenmore-, by Dr. Cox, has a tailor-made option for egen called "sieve":

Excerpt from the helpfile:

sieve(strvar) , { keep(classes) | char(chars) | omit(chars) } selects characters from strvar according to a specified criterion and generates a new string variable containing only those characters. This may be done in three ways. First, characters are classified using the keywords alphabetic (any of a-z or A-Z), numeric (any of 0-9), space or other. keep() specifies one or more of those classes: keywords may be abbreviated by as little as one letter. Thus keep(a n) selects alphabetic and numeric characters and omits spaces and other

characters. Note that keywords must be separated by spaces. Alternatively, char() specifies each character to be selected or omit() specifies each character to be omitted. Thus char(0123456789.) selects numeric characters and the stop (presumably as decimal point); omit(" ") strips spaces and omit(`"""') strips double quotes. (Stata 7 required.)

Hope that helps. Jen

Dear statalist users, I need to clean a string variable containing the names of a large number of firms (over 30,000). In many cases these names contain extra characters that I would like to eliminate, such as % or " or ^. These characters always come at the beginning of the name. I know that Stata has a command (trim) that eliminates leading and trailing blank spaces from string variables. Is there a similar command to eliminate leading "undesired" characters? Thank you so much for your help. Best, Mario

