rhl6u350

Lemur zaprasza

Red Hat® Linux 6 Unleashed

Chapter 34: Programming in Python

Previous
ChapterNext
Chapter

Sections in this Chapter:

Getting
Ready to Run Python

Strings and Regular Expressions

Python
Command-Line Interpreter

Console
I/O

Python
Programs

Additional
Python Capabilities

Control
Statements

Functions
and Modules

Lists
and the range() Function

Previous
SectionNext
Section

Strings and Regular Expressions

Strings and
Regular Expressions Example

To a great extent, the productivity of a computer language
depends on how easily the programmer can modify and parse strings. It is
precisely these qualities that have contributed to the recent popularity of both
Python and Perl.

Strings
Python has a
rich variety of string
handling features. Features such as appending with the plus sign, string
slicing, and the conversion functions str() and
num() are built into the language. If those
aren't enough, Python's string module offers most of the C
language's string.h functionality.

Appending with the Plus Sign
The plus (+) sign allows
concatenation
of strings, as shown in the following source code:

fname = "resolv.conf"
dname = "/etc"
fullpath = dname + "/" + fname
print fullpath

This concatenates the directory, a slash, and the filename,
resulting in
the
following output:

/etc/resolv.conf

String Slices
String slices
are
Python's ultra-flexible built-in method of extracting substrings from
larger strings. They can slice off the beginning or the end relative to either
the beginning or end, or slice out the middle relative to either the beginning
or end. When you're using string slices, remember that the first character
is element 0. Perhaps the simplest string slice is grabbing a single character
(this could also be considered subscripting):

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[2]

The preceding prints element 2, which is the third character,
C. Next, consider the following code, which prints
the first five characters of the string:

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[0:5]

The preceding prints ABCDE. Often
you'll want to remove a certain number of characters from the end of a
string. Most often this is done to remove a newline from the end of the string.
Using a zero before the colon and a negative number after it trims off the
number of characters equal to the absolute value of the negative number from the
end of the string:

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[0:-1]

The preceding code trims the Z off
the end of the string. Often you need just a portion from the middle of the
string:

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[2:5]

The preceding code prints elements 2, 3, and 4, which make up
the string CDE. The best way to remember this is
that the number of elements in the slice is the difference between the second
and first numbers. Unless, of course, the second number goes past the end of the
string, in which case the effect is simply to trim off the characters before the
element corresponding to the first number. For instance:

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[2:1000]

The preceding prints
CDEFGHIJKLMNOPQRSTUVWXYZ. It strips off elements 0
and 1, printing element 2 (C) as the first
character. To print the last three characters of the string, you can combine a
huge second number with a negative first number:

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[-3:1000]

The preceding code prints the last three characters,
XYZ. If you find it unappealing to provide an
arbitrarily large number, you can use the len()
function to get the exact string length and accomplish the same thing, as
illustrated in the following example:

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[-3:len(a)]

Once again, the output is XYZ.
Occasionally you may wish to extract a substring from almost the end. In that
case, both numbers are negative. For instance, to get the fourth-from-last,
third-from-to last, and second-from-last characters of the string:

a="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print a[-4:-1]

The preceding code prints WXY.
Remember that the number of characters returned is the difference between the
numbers, assuming the original string contains enough
characters.String slicing is a versatile tool
that's built into the Python language. There are more uses of string
slicing, but the preceding discussion will give you a
strong
foundation.

Converting with str(),
repr(), and eval()
Python is
more
strongly typed than some languages (Perl, for instance), so strings and numbers
can't be mixed without conversion. The str()
and repr() functions convert a number to a string,
while the eval() function converts a string to a
number. The difference between str() and
repr() is that repr()
attempts to return a string that can be converted back to a number by
eval(), while str()
attempts to return a printable string. In practice, both
repr() and str() usually
do an excellent job of returning a string that's convertible back to a
number. Consider this example:

pi_string = "3.14159"
r = 2.0
area = eval(pi_string) * r * r
print "Area = " + repr(area)
print "Area = " + str(area)
print "Area =",
print 3.14159 * 2 * 2

The preceding code prints the area of a radius 2 circle three
times. The first two times, it converts a string representation of pi to a
number before doing the calculation, and the third time, it calculates from
numerical constants. The first and second time differ only in that the first
uses repr(), while the second uses
str(). All three produce
the
identical results shown here:

Area = 12.56636
Area = 12.56636
Area = 12.56636

Rearranging a String
String slicing
and appending can be used to
rearrange strings. Consider the following function to convert a MM/DD/YYYY date
to YYYYMMDD:

def yyyymmdd(d):
return(d[-4:1000] + d[0:2] +d[3:5])

print yyyymmdd("07/04/1999")
print yyyymmdd("12/25/1999")
print yyyymmdd("01/01/2000")

Function yymmdd returns the last
four characters, then the first two, and then the fourth and fifth (elements 3
and 4) to produce this output:

19990704
19991225
20000101

This simplistic function assumes an exact format for the input
date. Later in this chapter you'll build a "smarter" converter
using string and number
conversions and regular
expressions.

The String Module
Occasionally
the
previously described string capabilities aren't enough. Python's
string module is used for those cases. Some
functions included in the string module are
explained in this
section.atof(),
atoi(), and atol() are
sophisticated
alternatives for the capabilities yielded by eval().
These alternatives are capable of working with number systems other than
decimal.Various case conversions return copies of
their arguments rather than changing the string in place.
capitalize()
capitalizes
the first character of a string.
capwords()
capitalizes the first letter of each word (but has the side effect of removing
redundant whitespace and leading and trailing whitespace).
upper
()
completely capitalizes its argument. lower()
converts
every letter to lowercase.
swapcase()
converts all lowercase to uppercase, and vice
versa.Functions
lstrip(),
rstrip(),
and
strip()
return copies of their argument with whitespace stripped from the left, the
right, and both sides,
respectively.split(s[, sep[,
maxsplit]])
returns a list of substrings from string s. The
default for sep is one or more contiguous whitespace
characters (in other words, it splits the string into space-delimited words).
Used with a sep argument, this function becomes a
powerful aid in parsing delimited data files.
maxsplit defaults to 0, but if positive it declares
the maximum number of split, with any remainder
becoming the last entry in the list. The following
can split a quote- and comma-delimited record into fields. Note that the actual
parsing is accomplished in two lines:

import string
s="\"Smith\",\"John\",\"developer\""

s = s[1:-1] #strip first and last quote
z = string.split(s,'","') #split by ","

for x in z: #print fields
print x

join(list[, sep]) is the inverse of
split(). It joins the list into a single string,
with each item separated by sep if it's used.
If sep is not used, sep
defaults to a single space. Thus,
string.join(string.split(s)) removes extra
whitespace from string s, while
string.join(string.split(s),"") removes
all whitespace, and
string.join(string.split(s),"|") pipe
character-delimits the former whitespace-delimited
words.find()
and
rfind()
are used to find substrings from the left and right, respectively. Function
count()
counts the number of non-overlapping substring occurrences in the
string.zfill(s, width)
left-fills a string with zeros.There are several more
functions in the string module. They can be found in
the module documentation. To use the string module, remember these two
requirements:

The import string
command must appear at the
program's top.
l

Each function
must
be preceded by the word string and a
period.
l

Regular Expressions
Regular
expressions
enable the programmer to complete parsing tasks in a few lines of code instead
of the 20-100 lines required in the C language. Regular expressions are
flexible, wildcard-enabled strings that are used to match, pick apart, and
translate other strings.

Regular Expressions: Python and Perl
The Perl
language'ssuccess
can be attributed partially to its inclusion of regular expressions. Python also
supports regular expressions with the new, Perl-compatible
re module, as well as the obsolete
regex module (which you shouldn't use). The
syntax for invoking regular expressions, and retrieving groups, between the two
languages is as different as can be. But the regular expressions themselves are
identical. So if you can construct a Perl regular expression, you can do the
same in Python. Here is a Perl example for finding the seven characters before
the word "Linux" in a string:

#!/usr/bin/perl

my($a) = "I like Red Hat Linux for development.";
$a =~ m/(.{7})Linux /;
my($b) = $1; #group 1
print "\n";

Here's the same code written in Python:

#!/usr/bin/python
import re

a = "I like Red Hat Linux for development."
m = re.search("(.{7})Linux", a)
b = m.group(1)
print "<" + b + ">"

Each prints the string <ed Hat
>. The syntax is completely different, but the regular expression, in
this case (.{7})Linux, remains the
same.The preceding code uses the
re module's search
function. The re module also contains a
match function, which will not be covered in this
chapter.Note the line import
re.
This
must appear in every module using the re
module's regular expressions.

Simple Matches
The simplest use
of
a regular expression is to determine whether a string conforms to the specified
regular expression. Perhaps the simplest means is searching the string for the
existence of a substring:

a = "I like Red Hat Linux for development."
m = re.search("Linux", a)
if m == None:
print "Not found"
else:
print "Found"

The function re.search()
returns
a match
object.
The match object, which was assigned to variable m
in the preceding code, contains all information concerning the application of
the regular expression against the string. If nothing in the string matches the
regular expression, re.search() returns special
value None.There are
several
wildcards.
Here are the most important ones:

Any character

Beginning of the string

End of the string

Whitespace character

Non-whitespace character

Digit character

Non-digit character

There are several repetition
specifiers.
These can be placed after a character or wildcard to indicate repetition of the
character or wildcard. Here are the most important ones:

0 or more repetitions

1 or more repetitions

1 or 0 repetitions

{n}

Exactly n repetitions

{n,}

At least n repetitions

{n,m}

At least n but not
more than m repetitions

Several flags can be used to modify the behavior of the regular
expression search. These flags are numerical constants used as an optional third
argument to the re.search() function. These flags
can be ORed using the pipe symbol to accomplish
multiple modifications. By far the most common flag is
re.IGNORECASE, which ignores case during searches.
There are several others, which can be found in Python's
documentation.Here's a comparison of a search
with and without re.IGNORECASE:

m = re.search("Linux", a, re.IGNORECASE)
m = re.search("Linux", a)

The first search finds Linux,
LINUX, linux,
lInUx, and any other upper- and lowercase
combination of Linux. The second search finds only
the exact string Linux.To
demonstrate
wildcards
and repetitions, here is an overly simple regular expression to identify whether
a date exists in a line:

a = "Valentines is 2/14/2000. Don\'t forget!"
m = re.search("\D\d{1,2}/\d{1,2}/\d{2,4}\D", a)
if m == None:
print "No date in string."
else:
print "String contains date."

The preceding code checks for the existence, anywhere in the
string, of a non-digit followed by one or two digits, followed by a slash,
followed by one or two digits, followed by another slash, followed by two,
three,
orfour
digits, followed by a non-digit.

Note -
There is an alternative syntax that precompiles the regular expression for
faster use in tight loops. In my experiments, it improved regular expression
performance by roughly 15 percent. It is not covered in this chapter. If you
need better regular expression performance in loops, look up
re.compile() in your Python
documentation.

Simple Parsing with Groups
Classifying
strings
is nice, but the real power comes from the ability to parse strings. Perhaps the
simplest example is changing a file extension. Consider the
following:

src = "myfile.conf"
m = re.search("(\S+)\.conf", src)
dst = m.group(1) + ".bak"
print dst

The preceding code searches the source string for a group of one
or more non- whitespace characters, followed by
.conf, and creates a match object that is assigned
to variable m. The group, which is specified by the
parentheses around the \S+, is available as
m.group(1), to which
.bak is appended to complete the destination
name.Here's an example of parsing dates. Note
that this example is not complete enough for use in applications:

a = "11/21/1999"
m = re.search("^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})$", a)
print m.groups()
month = m.group(1)
day = m.group(2)
year = m.group(3)
print year, month, day

Carefully consider the search
statement in the preceding code sample. It looks for a string consisting of a
date as a one- or two-digit number, with the tens place being 0 or 1 (if
existing), followed by either a slash or hyphen, followed by another one- or
two-digit number, this one with the tens place being 0, 1, 2, or 3. This second
one- or two-digit number is followed by another slash or hyphen, followed by any
number of two to four digits. Note that this is about as much validation as can
be done without integer arithmetic.Each number in the
regular expression is surrounded by parentheses so that each will be accessible
as a group in the match object. The groups are then evaluated and assigned to
month, day, and year. The sample code prints the following:

1999 11 21

Another way to accomplish the same objective is to use the
groups() function to return a tuple that can then be
assigned the groups, as shown below:

a = "11/21/1999"
m = re.search("^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})", a)
month,day,year = m.groups()
print year, month, day

Note that the assignment of the
m.groups() to three variables works only if the
number of variables equals the number of elements in the tuple returned by
m.group. You know in advance how many elements there
will be by the number of pairs of parentheses inserted in the regular
expression. You can also access the number of elements in the
tuple
with the len() function.

Regular Expression Example
The
following
complete program takes a file called test.tst,
searches it for lines containing text inside square brackets, and returns that
text minus any left or right space:

#!/usr/bin/python
import re
infile = open("test.tst", "r")
x = infile.readline()
while x != "":
x = x[:-1]
m = re.search("\[\s*(.*)\s*\]", x)
if m:
print m.group(1)
x = infile.readline()
infile.close()

The preceding code reads every line of file
test.tst, checks it for text between brackets, and
prints it (if such text exists). Since many types of configuration files use
brackets for headers, this can be molded
into
useful code.

Strings and Regular Expressions Example
Listing 34.1
illustrates
many features of strings and regular expressions. It repeatedly queries the user
to type in dates, evaluating, checking, and printing those dates until the user
types in a single lowercase q.

LISTING 34.1 ex34_1.py Prints
Dates in Different Formats

#!/usr/bin/python
###########################################
# Sample Only. Do not use in production.
###########################################

import re #regular expressions
import string #string manipulation
import readline #command line editing

def std(s):
m=re.search("^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})",s)
if m:
mm = eval(m.group(1))
dd = eval(m.group(2))
yyyy = eval(m.group(3))
if yyyy < 40:
yyyy = yyyy + 2000
elif yyyy < 100:
yyyy = yyyy + 1900
rv = (yyyy,mm,dd)
else:
rv = None
return(rv)

def mdy(t):
if t:
mstring = string.zfill(t[1],2)
dstring = string.zfill(t[2],2)
return(mstring + "/" + dstring + "/" + str(t[0]))
else:
return("bad date")

def ymd(t):
if t:
mstring = string.zfill(t[1],2)
dstring = string.zfill(t[2],2)
ystring = string.zfill(t[0],4)
return(ystring + mstring + dstring)
else:
return("bad date")

def printdates(s):
print std(s);
print mdy(std(s))
print ymd(std(s))
def main():
x = raw_input("Please type a date, q to quit==>")
while(x != "q"):
printdates(x)
print
x = raw_input("Please type a date, q to quit==>")

main()

Function std()
creates a standard ymd tuple from its string
argument, returning None if the string is not a
date. Function mdy()
formats a standard ymd tuple as a mm/dd/yyyy string,
while function ymd()
formats a standard ymd tuple as a yyyymmdd string.
Both mdy() and ymd()
return
the string bad date if passed None.

Red Hat® Linux 6 Unleashed

Chapter 34: Programming in Python

Previous
ChapterNext
Chapter

Sections in this Chapter:

Getting
Ready to Run Python

Strings and Regular Expressions

Python
Command-Line Interpreter

Console
I/O

Python
Programs

Additional
Python Capabilities

Control
Statements

Functions
and Modules

Lists
and the range() Function

Previous
SectionNext
Section