Parsing a directory tree in Bash

Date completed: October 8, 2015
Language: Bash shell script on Ubuntu linux

Description of work:¬†In Drexel University’s CS265 class, “Advanced Programming Techniques”, the first assignment calls for a script to go through a specified (or current) directory tree, creating a file “dir.xml” at each level.

Each directory¬†may or may not have a README file specifying one “index” entry and/or one or more “required” files (colon-delimited). The script must use awk to process the README file into <index> and <required> sections, and the xml file includes an <other> section only if there are other files found in the directory which weren’t named already in prior sections.

Here I outline a few scripting techniques used in this assignment, and following that, the full text of the script is included.

First, a sample dir.xml:

<?xml version="1.0" encoding="ISO-8859-1"?>
<direntry>
	<index>
		<file>index.html</file>
	</index>
	<required>
		<file>file2.1</file>
		<file>file2.2</file>
	</required>
	<other>
		<dir>Data</dir>
		<file>other5</file>
		<file>other4</file>
		<file>other1</file>
		<file>README</file>
	</other>
</direntry>

Rather than including multiple files in the submission, awk is used to split an awk script included at the end, as such:

awkFile=$(mktemp)
tail -6 "$0" &gt; "$awkFile"

This dumps the last 6 lines of the script out to a temporary file. At the end of the main script, there is an “exit” command to be sure the included awk isn’t treated like commands:

exit
#!/bin/awk -f
BEGIN { FS = ":" }
/^index/ { printf "t<index>ntt<file>%s</file>nt</index>n", $2 }
/^required/ { print "t<required>"
        for (i=2 ; i<=NF ; i++ ) printf "tt<file>%s</file>n", $i
        print "t</required>" }

Note the awk script processes lines beginning with “index” into XML format, and tabulates through the colon-delimited required section.

The main script uses a call to find to locate all subfolders in the tree, calling awk with this file to produce the first two sections of the XML file:

# Determine where we are looking
if (( $# < 1 )) ; then curDir="$PWD" ; else curDir="$1" ; fi

# Search current folder for all subfolders, and create XML per specifications
for i in $(find "$curDir" -type d -not -path .)
do
	# Begin creating the XML index
	echo -e '<?xml version="1.0" encoding="ISO-8859-1"?>n<direntry>' > "$i"/dir.xml

	# If we have a README file, process it to the output.
	cat "$i"/README 2>/dev/null | awk -f "$awkFile" >> "$i"/dir.xml
	if [ "$?" -eq "0" ]; then
		# Since it was successful, build an exclusion list.

Following this, a method is needed to know which files were accounted for in the README (I wasn’t able to figure out a way to do this all with a single call to awk, unfortunately). For this, I build a comma-separated exclusion list of the form :file1:file2:file3: – in this manner, the list can be searched for file1 by the expression *:file1:*. Of course, a colon delimiter was chosen because the README file already uses this as a delimiter, so the exclusion list is built by simply seeking out those sections of the file using grep:

	if [ "$?" -eq "0" ]; then
		# Since it was successful, build an exclusion list.
		exclIndex=$(grep "index:" "$i"/README 2>/dev/null)
		# Trim the "index" part
		if [ -n $exclIndex ]; then exclIndex="${exclIndex:5}" ; fi
		exclFiles=$(grep "required:" "$i"/README 2>/dev/null)
		# Trim the "required"
		if [ -n $exclFiles ]; then exclFiles="${exclFiles:8}" ; fi
		exclFiles="$exclIndex$exclFiles" # Combine the lists
		# Append a colon if necessary:
		if [ -n $exclFiles ]; then exclFiles="$exclFiles:" ; fi
	fi

In the next part of the processing, a listing of the directory in alphabetical order by ls -o is formatted with awk, to contain the first char of the permissions (either – or d for directory) followed by the file name, and sorted so the sub-directories are at the top. This is all in a for loop:

	# Generate the other files XML first
	otherXML=""
	# The following for loop ignores existing dir.xml and the first char indicates directory.
	# All the directories go first.
	for j in $(ls -o "$i" | awk '{if (NR>1 && $0 !~ "dir.xml") print substr($1,0,1)$8 }' | LC_ALL=C sort -r)
	do
		if [ ${j:0:1} = "d" ]; then  # Check if it's a directory
			otherXML="$otherXMLtt<dir>${j:1}</dir>n"
		elif [[ $exclFiles != *":${j:1}:"* ]]; then # Check for exclusion
			otherXML="$otherXMLtt<file>${j:1}</file>n"
		fi
	done
	if [ -n "$otherXML" ]; then
		# At this point, the <other> section is needed
		echo -e "t<other>n$otherXMLt</other>" >> "$i"/dir.xml
	fi
	echo -e '</direntry>' >> "$i"/dir.xml
done
rm "$awkFile" # Clean up.
exit

This builds an <other&rt; section, which is concatenated to the dir.xml file only if there are entries in that section.

The full script follows:

#!/bin/bash
### Assignment 1 - XML Dir
# Derek Yerger 10/8/2015 - Drexel University CS265-003
#
# This script creates an XML index in the passed (or current) directory and
# all sub-directories.

# Generate awk stream processing file (included at the end of this script)
awkFile=$(mktemp)
tail -6 "$0" > "$awkFile"

# Determine where we are looking
if (( $# < 1 )) ; then curDir="$PWD" ; else curDir="$1" ; fi

# Search current folder for all subfolders, and create XML per specifications
for i in $(find "$curDir" -type d -not -path .)
do
	# Begin creating the XML index
	echo -e '<?xml version="1.0" encoding="ISO-8859-1"?>n<direntry>' > "$i"/dir.xml

	# If we have a README file, process it to the output.
	cat "$i"/README 2>/dev/null | awk -f "$awkFile" >> "$i"/dir.xml
	if [ "$?" -eq "0" ]; then
		# Since it was successful, build an exclusion list.
		exclIndex=$(grep "index:" "$i"/README 2>/dev/null)
		# Trim the "index" part
		if [ -n $exclIndex ]; then exclIndex="${exclIndex:5}" ; fi
		exclFiles=$(grep "required:" "$i"/README 2>/dev/null)
		# Trim the "required"
		if [ -n $exclFiles ]; then exclFiles="${exclFiles:8}" ; fi
		exclFiles="$exclIndex$exclFiles" # Combine the lists
		# Append a colon if necessary:
		if [ -n $exclFiles ]; then exclFiles="$exclFiles:" ; fi
	fi

	# Generate the other files XML first
	otherXML=""
	# The following for loop ignores existing dir.xml and the first char indicates directory.
	# All the directories go first.
	for j in $(ls -o "$i" | awk '{if (NR>1 && $0 !~ "dir.xml") print substr($1,0,1)$8 }' | LC_ALL=C sort -r)
	do
		if [ ${j:0:1} = "d" ]; then  # Check if it's a directory
			otherXML="$otherXMLtt<dir>${j:1}</dir>n"
		elif [[ $exclFiles != *":${j:1}:"* ]]; then # Check for exclusion
			otherXML="$otherXMLtt<file>${j:1}</file>n"
		fi
	done
	if [ -n "$otherXML" ]; then
		# At this point, the <other> section is needed
		echo -e "t<other>n$otherXMLt</other>" >> "$i"/dir.xml
	fi
	echo -e '</direntry>' >> "$i"/dir.xml
done
rm "$awkFile" # Clean up.
exit
#!/bin/awk -f
BEGIN { FS = ":" }
/^index/ { printf "t<index>ntt<file>%s</file>nt</index>n", $2 }
/^required/ { print "t<required>"
        for (i=2 ; i<=NF ; i++ ) printf "tt<file>%s</file>n", $i
        print "t</required>" }