Lemur zaprasza
Red Hat® Linux 6 Unleashed
Chapter 34: Programming in Python Previous
Sections in this Chapter: Getting Ready to Run Python Strings and Regular Expressions Python Command-Line Interpreter Console I/O Python Programs Additional Python Capabilities Control Statements Functions and Modules Lists and the range() Function Previous SectionNext Section Strings and Regular Expressions Strings and Regular Expressions Example To a great extent, the productivity of a computer language depends on how easily the programmer can modify and parse strings. It is precisely these qualities that have contributed to the recent popularity of both Python and Perl. Strings Python has a rich variety of string handling features. Features such as appending with the plus sign, string slicing, and the conversion functions str() and num() are built into the language. If those aren't enough, Python's string module offers most of the C language's string.h functionality. Appending with the Plus Sign The plus (+) sign allows concatenation of strings, as shown in the following source code: fname = "resolv.conf" dname = "/etc" fullpath = dname + "/" + fname print fullpath This concatenates the directory, a slash, and the filename, resulting in the following output: /etc/resolv.conf String Slices String slices are Python's ultra-flexible built-in method of extracting substrings from larger strings. They can slice off the beginning or the end relative to either the beginning or end, or slice out the middle relative to either the beginning or end. When you're using string slices, remember that the first character is element 0. Perhaps the simplest string slice is grabbing a single character (this could also be considered subscripting): a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[2] The preceding prints element 2, which is the third character, C. Next, consider the following code, which prints the first five characters of the string: a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[0:5] The preceding prints ABCDE. Often you'll want to remove a certain number of characters from the end of a string. Most often this is done to remove a newline from the end of the string. Using a zero before the colon and a negative number after it trims off the number of characters equal to the absolute value of the negative number from the end of the string: a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[0:-1] The preceding code trims the Z off the end of the string. Often you need just a portion from the middle of the string: a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[2:5] The preceding code prints elements 2, 3, and 4, which make up the string CDE. The best way to remember this is that the number of elements in the slice is the difference between the second and first numbers. Unless, of course, the second number goes past the end of the string, in which case the effect is simply to trim off the characters before the element corresponding to the first number. For instance: a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[2:1000] The preceding prints CDEFGHIJKLMNOPQRSTUVWXYZ. It strips off elements 0 and 1, printing element 2 (C) as the first character. To print the last three characters of the string, you can combine a huge second number with a negative first number: a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[-3:1000] The preceding code prints the last three characters, XYZ. If you find it unappealing to provide an arbitrarily large number, you can use the len() function to get the exact string length and accomplish the same thing, as illustrated in the following example: a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[-3:len(a)] Once again, the output is XYZ. Occasionally you may wish to extract a substring from almost the end. In that case, both numbers are negative. For instance, to get the fourth-from-last, third-from-to last, and second-from-last characters of the string: a="ABCDEFGHIJKLMNOPQRSTUVWXYZ" print a[-4:-1] The preceding code prints WXY. Remember that the number of characters returned is the difference between the numbers, assuming the original string contains enough characters.String slicing is a versatile tool that's built into the Python language. There are more uses of string slicing, but the preceding discussion will give you a strong foundation. Converting with str(), repr(), and eval() Python is more strongly typed than some languages (Perl, for instance), so strings and numbers can't be mixed without conversion. The str() and repr() functions convert a number to a string, while the eval() function converts a string to a number. The difference between str() and repr() is that repr() attempts to return a string that can be converted back to a number by eval(), while str() attempts to return a printable string. In practice, both repr() and str() usually do an excellent job of returning a string that's convertible back to a number. Consider this example: pi_string = "3.14159" r = 2.0 area = eval(pi_string) * r * r print "Area = " + repr(area) print "Area = " + str(area) print "Area =", print 3.14159 * 2 * 2 The preceding code prints the area of a radius 2 circle three times. The first two times, it converts a string representation of pi to a number before doing the calculation, and the third time, it calculates from numerical constants. The first and second time differ only in that the first uses repr(), while the second uses str(). All three produce the identical results shown here: Area = 12.56636 Area = 12.56636 Area = 12.56636 Rearranging a String String slicing and appending can be used to rearrange strings. Consider the following function to convert a MM/DD/YYYY date to YYYYMMDD: def yyyymmdd(d): return(d[-4:1000] + d[0:2] +d[3:5]) print yyyymmdd("07/04/1999") print yyyymmdd("12/25/1999") print yyyymmdd("01/01/2000") Function yymmdd returns the last four characters, then the first two, and then the fourth and fifth (elements 3 and 4) to produce this output: 19990704 19991225 20000101 This simplistic function assumes an exact format for the input date. Later in this chapter you'll build a "smarter" converter using string and number conversions and regular expressions. The String Module Occasionally the previously described string capabilities aren't enough. Python's string module is used for those cases. Some functions included in the string module are explained in this section.atof(), atoi(), and atol() are sophisticated alternatives for the capabilities yielded by eval(). These alternatives are capable of working with number systems other than decimal.Various case conversions return copies of their arguments rather than changing the string in place. capitalize() capitalizes the first character of a string. capwords() capitalizes the first letter of each word (but has the side effect of removing redundant whitespace and leading and trailing whitespace). upper () completely capitalizes its argument. lower() converts every letter to lowercase. swapcase() converts all lowercase to uppercase, and vice versa.Functions lstrip(), rstrip(), and strip() return copies of their argument with whitespace stripped from the left, the right, and both sides, respectively.split(s[, sep[, maxsplit]]) returns a list of substrings from string s. The default for sep is one or more contiguous whitespace characters (in other words, it splits the string into space-delimited words). Used with a sep argument, this function becomes a powerful aid in parsing delimited data files. maxsplit defaults to 0, but if positive it declares the maximum number of split, with any remainder becoming the last entry in the list. The following can split a quote- and comma-delimited record into fields. Note that the actual parsing is accomplished in two lines: import string s="\"Smith\",\"John\",\"developer\"" s = s[1:-1] #strip first and last quote z = string.split(s,'","') #split by "," for x in z: #print fields print x join(list[, sep]) is the inverse of split(). It joins the list into a single string, with each item separated by sep if it's used. If sep is not used, sep defaults to a single space. Thus, string.join(string.split(s)) removes extra whitespace from string s, while string.join(string.split(s),"") removes all whitespace, and string.join(string.split(s),"|") pipe character-delimits the former whitespace-delimited words.find() and rfind() are used to find substrings from the left and right, respectively. Function count() counts the number of non-overlapping substring occurrences in the string.zfill(s, width) left-fills a string with zeros.There are several more functions in the string module. They can be found in the module documentation. To use the string module, remember these two requirements: l The import string
l Each function
Regular Expressions Regular expressions enable the programmer to complete parsing tasks in a few lines of code instead of the 20-100 lines required in the C language. Regular expressions are flexible, wildcard-enabled strings that are used to match, pick apart, and translate other strings. Regular Expressions: Python and Perl The Perl language'ssuccess can be attributed partially to its inclusion of regular expressions. Python also supports regular expressions with the new, Perl-compatible re module, as well as the obsolete regex module (which you shouldn't use). The syntax for invoking regular expressions, and retrieving groups, between the two languages is as different as can be. But the regular expressions themselves are identical. So if you can construct a Perl regular expression, you can do the same in Python. Here is a Perl example for finding the seven characters before the word "Linux" in a string: #!/usr/bin/perl my($a) = "I like Red Hat Linux for development."; $a =~ m/(.{7})Linux /; my($b) = $1; #group 1 print "\n"; Here's the same code written in Python: #!/usr/bin/python import re a = "I like Red Hat Linux for development." m = re.search("(.{7})Linux", a) b = m.group(1) print "<" + b + ">" Each prints the string <ed Hat >. The syntax is completely different, but the regular expression, in this case (.{7})Linux, remains the same.The preceding code uses the re module's search function. The re module also contains a match function, which will not be covered in this chapter.Note the line import re. This must appear in every module using the re module's regular expressions. Simple Matches The simplest use of a regular expression is to determine whether a string conforms to the specified regular expression. Perhaps the simplest means is searching the string for the existence of a substring: a = "I like Red Hat Linux for development." m = re.search("Linux", a) if m == None: print "Not found" else: print "Found" The function re.search() returns a match object. The match object, which was assigned to variable m in the preceding code, contains all information concerning the application of the regular expression against the string. If nothing in the string matches the regular expression, re.search() returns special value None.There are several wildcards. Here are the most important ones: . Any character ^ Beginning of the string $ End of the string \s Whitespace character \S Non-whitespace character \d Digit character \D Non-digit character There are several repetition specifiers. These can be placed after a character or wildcard to indicate repetition of the character or wildcard. Here are the most important ones: * 0 or more repetitions + 1 or more repetitions ? 1 or 0 repetitions {n} Exactly n repetitions {n,} At least n repetitions {n,m} At least n but not
Several flags can be used to modify the behavior of the regular expression search. These flags are numerical constants used as an optional third argument to the re.search() function. These flags can be ORed using the pipe symbol to accomplish multiple modifications. By far the most common flag is re.IGNORECASE, which ignores case during searches. There are several others, which can be found in Python's documentation.Here's a comparison of a search with and without re.IGNORECASE: m = re.search("Linux", a, re.IGNORECASE) m = re.search("Linux", a) The first search finds Linux, LINUX, linux, lInUx, and any other upper- and lowercase combination of Linux. The second search finds only the exact string Linux.To demonstrate wildcards and repetitions, here is an overly simple regular expression to identify whether a date exists in a line: a = "Valentines is 2/14/2000. Don\'t forget!" m = re.search("\D\d{1,2}/\d{1,2}/\d{2,4}\D", a) if m == None: print "No date in string." else: print "String contains date." The preceding code checks for the existence, anywhere in the string, of a non-digit followed by one or two digits, followed by a slash, followed by one or two digits, followed by another slash, followed by two, three, orfour digits, followed by a non-digit. Note - There is an alternative syntax that precompiles the regular expression for faster use in tight loops. In my experiments, it improved regular expression performance by roughly 15 percent. It is not covered in this chapter. If you need better regular expression performance in loops, look up re.compile() in your Python documentation. Simple Parsing with Groups Classifying strings is nice, but the real power comes from the ability to parse strings. Perhaps the simplest example is changing a file extension. Consider the following: src = "myfile.conf" m = re.search("(\S+)\.conf", src) dst = m.group(1) + ".bak" print dst The preceding code searches the source string for a group of one or more non- whitespace characters, followed by .conf, and creates a match object that is assigned to variable m. The group, which is specified by the parentheses around the \S+, is available as m.group(1), to which .bak is appended to complete the destination name.Here's an example of parsing dates. Note that this example is not complete enough for use in applications: a = "11/21/1999" m = re.search("^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})$", a) print m.groups() month = m.group(1) day = m.group(2) year = m.group(3) print year, month, day Carefully consider the search statement in the preceding code sample. It looks for a string consisting of a date as a one- or two-digit number, with the tens place being 0 or 1 (if existing), followed by either a slash or hyphen, followed by another one- or two-digit number, this one with the tens place being 0, 1, 2, or 3. This second one- or two-digit number is followed by another slash or hyphen, followed by any number of two to four digits. Note that this is about as much validation as can be done without integer arithmetic.Each number in the regular expression is surrounded by parentheses so that each will be accessible as a group in the match object. The groups are then evaluated and assigned to month, day, and year. The sample code prints the following: 1999 11 21 Another way to accomplish the same objective is to use the groups() function to return a tuple that can then be assigned the groups, as shown below: a = "11/21/1999" m = re.search("^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})", a) month,day,year = m.groups() print year, month, day Note that the assignment of the m.groups() to three variables works only if the number of variables equals the number of elements in the tuple returned by m.group. You know in advance how many elements there will be by the number of pairs of parentheses inserted in the regular expression. You can also access the number of elements in the tuple with the len() function. Regular Expression Example The following complete program takes a file called test.tst, searches it for lines containing text inside square brackets, and returns that text minus any left or right space: #!/usr/bin/python import re infile = open("test.tst", "r") x = infile.readline() while x != "": x = x[:-1] m = re.search("\[\s*(.*)\s*\]", x) if m: print m.group(1) x = infile.readline() infile.close() The preceding code reads every line of file test.tst, checks it for text between brackets, and prints it (if such text exists). Since many types of configuration files use brackets for headers, this can be molded into useful code. Strings and Regular Expressions Example Listing 34.1 illustrates many features of strings and regular expressions. It repeatedly queries the user to type in dates, evaluating, checking, and printing those dates until the user types in a single lowercase q. LISTING 34.1 ex34_1.py Prints Dates in Different Formats #!/usr/bin/python ########################################### # Sample Only. Do not use in production. ########################################### import re #regular expressions import string #string manipulation import readline #command line editing def std(s): m=re.search("^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})",s) if m: mm = eval(m.group(1)) dd = eval(m.group(2)) yyyy = eval(m.group(3)) if yyyy < 40: yyyy = yyyy + 2000 elif yyyy < 100: yyyy = yyyy + 1900 rv = (yyyy,mm,dd) else: rv = None return(rv) def mdy(t): if t: mstring = string.zfill(t[1],2) dstring = string.zfill(t[2],2) return(mstring + "/" + dstring + "/" + str(t[0])) else: return("bad date") def ymd(t): if t: mstring = string.zfill(t[1],2) dstring = string.zfill(t[2],2) ystring = string.zfill(t[0],4) return(ystring + mstring + dstring) else: return("bad date") def printdates(s): print std(s); print mdy(std(s)) print ymd(std(s)) def main(): x = raw_input("Please type a date, q to quit==>") while(x != "q"): printdates(x) x = raw_input("Please type a date, q to quit==>") main() Function std() creates a standard ymd tuple from its string argument, returning None if the string is not a date. Function mdy() formats a standard ymd tuple as a mm/dd/yyyy string, while function ymd() formats a standard ymd tuple as a yyyymmdd string. Both mdy() and ymd() return the string bad date if passed None. Red Hat® Linux 6 Unleashed
Chapter 34: Programming in Python Previous
Sections in this Chapter: Getting Ready to Run Python Strings and Regular Expressions Python Command-Line Interpreter Console I/O Python Programs Additional Python Capabilities Control Statements Functions and Modules Lists and the range() Function Previous SectionNext Section © Copyright Macmillan USA. All rights reserved. |