How to Implement the Interpreter Design Pattern for Messy Data in Python

Figure 1. Top 25% of company coop salaries.

The Interpreter pattern can read with grammar

It is usually used to evaluate mathematical text such as “32 * 5 + 23”. The usefulness of this pattern lies in how it perform arbitrary combinations of operations with a relatively simple set of rules.

Figure 2. The Interpreter design pattern.
  • a TerminalExpression, also known as a LiteralExpression, and
  • a NonTerminalExpression, which may contain references to AbstractExpressions

Analyze the data for manually-reported salaries for coop

Our goal is to rank companies by the reported salaries according to a megathread on reddit for University of Waterloo students.

1password: 25/h (1st coop), 32/hr (3rd coop), 42/hr (5th coop?)
Accedo: 24/hr (3rd coop)
⁠Achievers inc: 20-25/hr
⁠ADP: less than 44/hr
AGF: 18.50/hr
Akuna Capital: 65 USD/hour + return flight + corporate housing
Amazon: $7912/mo + 1875 USD/mo stipend + relocation(?)
AMD: 27/hr
American Express: 34.5/hr
⁠Apple: (34/hr + 1300-1500 stipend/month) (37/hr + 1350 stipend for 3A term)
Arctic Wolf: 20% above coop average, 23/hr (1st coop), 34/hr (4th coop)
Athos: 5000 USD/mo
Atolio: 34/hr (3rd coop), 38/hr (5th coop), 42/hr (6th coop)
⁠Autodesk: 24-30/hr
...
  • X/hr
  • X/mo
  • USD X/month
  • X/hr (1st coop), Y/hr (3rd coop), Z/hr (5th coop)
  • etc.

We want to compare salary rankings using CAD / mo

In order to compare salaries, we need to use the same units, and for this example, we will convert the salaries to Canadian Dollars (CAD), and on a monthly basis.

  • We take the average for range values (e.g. 20–25/hr → 22.5/hr)
  • We take the average for different coop term pays ( “34/hr (3rd coop), 38/hr (5th coop), 42/hr (6th coop)” → 38/hr )
  • The coop average is CAD 30.0/hr

Define the AbstractExpression

To tackle the problem, we first define the AbstractExpression as follows:

# expressions.py

class AbstractExpression(object):
def __init__(self):
'''
Returns None.

__init__: None -> None
'''
pass
def interpret(self):
'''
Returns the value of the expression.

interpret: AbstractExpression -> float
'''
return 0
def __repr__(self):
'''
Returns the string representation of the evaluated.

__repr__: AbstractExpression -> str
'''
return str(self.interpret())

A TerminalExpression usually refers to a single evaluated number

We define it as the following:

# expressions.py

class LiteralExpression(AbstractExpression):
def __init__(self, string):
'''
Returns None.

__init__: str -> None
'''
self.string = string
def interpret(self):
'''
Returns the value of the expression.

interpret: LiteralExpression -> float
'''
return float(self.string)

# Example:
LiteralExpression("32").interpret() # This returns the value 32

Let’s define an AddExpression

Now that we have the base case, LiteralExpression, let’s add a simple addition operation to our interpreted language:

# expressions.py

class AddExpression(AbstractExpression):
def __init__(self, left, right):
'''
Returns None.

__init__: AddExpression -> None
'''
self.left = left
self.right = right
def interpret(self):
'''
Returns the value of the expression.

interpret: AddExpression -> float
'''
return self.left.interpret() + self.right.interpret()

# Example:
AddExpression(LiteralExpression("5"), LiteralExpression("6")) # Returns 11

Now we need a SubtractExpression

Almost exactly like the AddExpression, but we perform subtraction in interpret().

# expressions.py

class SubtractExpression(AbstractExpression):
def __init__(self, left, right):
'''
Returns None.

__init__: AbstractExpression AbstractExpression -> None
'''
self.left = left
self.right = right
def interpret(self):
'''
Returns the value of the expression.

interpret: SubtractExpression -> float
'''
return self.left.interpret() - self.right.interpret()

We also need to define MultiplyExpression

# expressions.py

class MultiplyExpression(AbstractExpression):
def __init__(self, left, right):
'''
Returns None.

__init__: AbstractExpression AbstractExpression -> None
'''
self.left = left
self.right = right
def interpret(self):
'''
Returns the value of the expression.

interpret: MultiplyExpression -> float
'''
return self.left.interpret() * self.right.interpret()

We also define a few other Expressions

# expressions.py

class PercentAboveExpression(AbstractExpression):
'''
X% above Y --> (Y) * (1 + X)
'''
def __init__(self, left, right):
'''
Returns None.

__init__: AbstractExpression AbstractExpression -> None
'''
self.left = left
self.right = right
def interpret(self):
'''
Returns the value of the expression.

interpret: PercentAboveExpression -> float
'''
return (self.right.interpret()) * \
(1 + self.left.interpret() / 100.0)
# expressions.py

class AverageExpression(AbstractExpression):
def __init__(self, array):
'''
Returns None.

__init__: (list AbstractExpression) -> None
'''
self.array = array
def interpret(self):
'''
Returns the value of the expression.

interpret: AverageExpression -> float
'''
sums = list(filter(
lambda x: x != 0, [x.interpret() for x in self.array]
))
return sum(sums) / len(sums) if (len(sums) > 0) else 0
  • PercentAboveExpression, to calculate text such as “20% above coop average”
  • AverageExpression, to calculate the average of “20–25/hr” and “34/hr (3rd coop), 38/hr (5th coop), 42/hr (6th coop)”

We now need a parser

Now that we have defined the AbstractExpression and its implementations, we need to write a function that converts an input string to Expression objects.

  • 1. Remove unnecessary text
  • 2. Convert variations of phrases such as “/year”, “/yr”, and “/y” to the same value
  • 3. Interpret phrases to convert them into the correct expressions
# contants.py

PER_HOUR = 1

PER_HR_TO_PER_MO = 40 * 4.34524 # 40 hrs a week, 4.34524 wks a month
PER_MO_TO_PER_HR = ( # working hours in a month
1 / PER_HR_TO_PER_MO
)

PER_YEAR_TO_PER_HR = ( # working hours in a year
1 / (12 * PER_HR_TO_PER_MO)
)
PER_WEEK_TO_PER_HR = 1 / 40 # working hours in a week
STIPEND_TO_PER_HR = ( # 4-month co-op
1 / 4 / PER_HR_TO_PER_MO
)

COOP_AVERAGE = {
"2021": {
"F": "30.0",
"S": "30.0"
}
}

CURRENCY_CONVERTER = {
"CAD": 1,
"USD": 1.26,
"¥": 0.01094
}

INPUT_FOLDER = "inputs"
OUTPUT_FOLDER = "outputs"
# helpers.py

def removeSymbols(string):
'''
Returns a string with all symbols removed.

remove_symbols: Str -> Str
'''
string = string.replace("$", "")
string = string.replace("\"", "")
string = string.replace("~", "")
return string
# helpers.py

def convert_currency(string):
'''
Returns a string with all currencies converted to CAD.

convert_currency: Str -> Str
'''
# For each key in CURRENCY_CONVERTER, replace with value
for key in CURRENCY_CONVERTER:
lowered_key = key.lower()
uppered_key = key.upper()
# Special case for yen
if lowered_key == "¥":
string = string.replace(
lowered_key,
f"{CURRENCY_CONVERTER[uppered_key]} * "
)
else:
string = string.replace(
lowered_key,
CURRENCY_CONVERTER[uppered_key]
)
return string
# helpers.py

def fix_variations(string):
'''
Returns a string with all variations such as
/year, /yr /y replaced with multiplications with numbers.

fix_variations: Str -> Str
'''
# /yr
string = string.replace("/year", f" * {PER_YEAR_TO_PER_HR}")
string = string.replace("/yr", f" * {PER_YEAR_TO_PER_HR}")
string = string.replace("/y", f" * {PER_YEAR_TO_PER_HR}")
string = string.replace("annual", f" * {PER_YEAR_TO_PER_HR}")
string = string.replace("/annum", f" * {PER_YEAR_TO_PER_HR}")
string = string.replace("/a", f" * {PER_YEAR_TO_PER_HR}")

# If X/hr regex (convert to /mo)
string = string.replace("/hour", f" * {PER_HOUR}")
string = string.replace("/hr", f" * {PER_HOUR}")
string = string.replace("/h", f" * {PER_HOUR}")

# Else if X/mo
string = string.replace("stipend/month", f" * {PER_MO_TO_PER_HR}")
string = string.replace("/month", f" * {PER_MO_TO_PER_HR}")
string = string.replace("/mo", f" * {PER_MO_TO_PER_HR}")

# Else if X/week or X/wk
string = string.replace("/week", f" * {PER_WEEK_TO_PER_HR}")
string = string.replace("/wk", f" * {PER_WEEK_TO_PER_HR}")
string = string.replace("/w", f" * {PER_WEEK_TO_PER_HR}")

# Else other cases
string = string.replace("relocation", f" * {STIPEND_TO_PER_HR}")
string = string.replace("stipend", f" * {STIPEND_TO_PER_HR}")
string = string.replace("signing bonus", f" * {STIPEND_TO_PER_HR}")
string = string.replace("bonus", f" * {STIPEND_TO_PER_HR}")

string = string.replace("s", "")

# Handle thousands
string = string.replace("k", " * 1000")
return string
# helpers.py

def get_coop_average(term, year):
'''
Returns the average coop salary for a given term
and year as a float in CAD PER HOUR units.

get_coop_average: Str Str -> Float
'''
return COOP_AVERAGE[str(year)][str(term)]
#salary_parser.py

import re
from expressions import AbstractExpression, AddExpression, \
SubtractExpression, PercentAboveExpression, AverageExpression, \
MultiplyExpression, LiteralExpression
from helpers import remove_symbols

def parse_expression(string):
'''
Returns an AbstractExpression representing the given string.

Requires:
- string is a valid salary expression

parse_expression: Str -> AbstractExpression
'''
string = string.strip()
# Remove symbols
string = remove_symbols(string)

if (string == ""): return AbstractExpression() # FIXME

# parse any parenthesis if any
result = re.search(r"(\(.*?\)|\(.*$)", string)
if (result):
resultString = result.string[result.start()+1:result.end()-1]
value = parse_expression(resultString).interpret()
string = string[:result.start()] + " or " + \
str(value) + " " + string[result.end():]
return parse_expression(string)

result = re.search(r"(\[.*?\]|\[.*$)", string)
if (result):
string = string[:result.start()] + string[result.end():]
return parse_expression(string)

if ("," in string or " or " in string):
return AverageExpression([
parse_expression(s) for s
in re.split(r",| or ", string)
])

if ("every add" in string): return AbstractExpression()

if ("% above" in string):
parts = string.split("% above")
leftString = parts[0]
rightString = parts[1]
return PercentAboveExpression(
parse_expression(leftString),
parse_expression(rightString)
)

if ("+" in string):
parts = string.split("+")
leftString = parts[0]
rightString = "+".join(parts[1:])
return AddExpression(
parse_expression(leftString),
parse_expression(rightString)
)

if ("above" in string):
parts = string.split("above")
leftString = parts[0]
rightString = "above".join(parts[1:])
return AddExpression(
parse_expression(leftString),
parse_expression(rightString)
)

if ("below" in string):
parts = string.split("below")
leftString = parts[0]
rightString = "below".join(parts[1:])
return SubtractExpression(
parse_expression(rightString),
parse_expression(leftString)
)

if ("*" in string):
parts = string.split("*")
leftString = parts[0]
rightString = "*".join(parts[1:])
return MultiplyExpression(
parse_expression(leftString),
parse_expression(rightString)
)

if ("to" in string):
parts = string.split("to")
leftString = parts[0]
rightString = "to".join(parts[1:])
return AverageExpression([
parse_expression(leftString),
parse_expression(rightString)
])

if (" " in string):
return AverageExpression([
parse_expression(s) for s
in string.split(" ")
])

if (string.replace('.','',1).isdigit()):
return LiteralExpression(string)

return AbstractExpression()
  • “% above” to create a PercentAboveExpression,
  • “+” to create an AddExpression,
  • “below” for a SubtractExpression,
  • “*” for a MultiplyExpression,
  • etc.
# main.py

import argparse
import os
import re
import json
import pandas as pd
from constants import PER_HR_TO_PER_MO, INPUT_FOLDER, OUTPUT_FOLDER
from helpers import get_coop_average, remove_articles, fix_variations
from salary_parser import parse_expression
from plotter import plot

def main(filename, term, year):
'''
Returns None.
Reads the file in path and gets the average salary for
each company in the file for the given term and year.
The function then saves this data to a csv file.
Also saves a bar chart for the company salaries.

Effects:
- Reads the file in path
- Writes to {OUTPUT_FOLDER}/output.csv and
{OUTPUT_FOLDER}/output_top_25_percent.csv
- Writes to {OUTPUT_FOLDER}/output.png and
{OUTPUT_FOLDER}/output_top_25_percent.png

Requires:
- term is F, W, or S

main: Str Str Str -> None
'''
companies = {}

path = os.path.join(INPUT_FOLDER, filename)

# Ensure input folder is created
if not os.path.exists(INPUT_FOLDER):
os.makedirs(INPUT_FOLDER)

# Perform the analysis
with open(path, 'r', encoding="utf8") as f:
lines = f.readlines()
for line in lines:
if (len(line.strip()) == 0): continue
line_parts = line.split(": ")
company = line_parts[0]
# Word joiner character removal
company = company.replace("\u2060", "")

salary_string = line_parts[1].replace("\n", "")
salary_string = re.sub(
r"(\d),(\d{3})",r"\g<1>\g<2>",
salary_string
)
salary_string = salary_string.lower()
salaries = []

salary_string_part = salary_string

salary_string_part = remove_articles(salary_string_part)

salary_string_part = fix_variations(salary_string_part)

salary_string_part.replace(
"coop average",
get_coop_average(term, year)
)

salary = parse_expression(salary_string_part).interpret()
if (salary != 0): salaries.append(salary)

average = (
sum(salaries) / len(salaries) \
if (len(salaries) > 0) else 0
)
average = average * PER_HR_TO_PER_MO

companies[company] = {"CAD/mo": average}

pd.options.display.float_format = "{:,.0f}".format

df = pd.read_json(
json.dumps(
companies, indent=4, sort_keys=True
),
orient='index'
)
df = df.sort_values(by=['CAD/mo'], ascending=False)
df = df['CAD/mo'].apply(lambda x: int(x))

# Ensure output folder is created
if not os.path.exists(OUTPUT_FOLDER):
os.makedirs(OUTPUT_FOLDER)

# Save the results
output_path = os.path.join(OUTPUT_FOLDER, "output.csv")
df.to_csv(output_path, index_label="Company")

output_path_top_25 = os.path.join(
OUTPUT_FOLDER,
"output_top_25_percent.csv"
)
df25 = df.head(int(df.count() * 0.25))
df25.to_csv(output_path_top_25, index_label="Company")

# Plot the graph
plot(df, "output.png")
plot(df25, "output_top_25_percent.png")

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'filename',
type=str,
help="File name of input file from input folder"
)
parser.add_argument('term', type=str, help="Term (F/W/S)")
parser.add_argument('year', type=str, help="Year (e.g. 2021)")
args = parser.parse_args()
main(args.filename, args.term, args.year)

Now for a visualization!

Finally, we plot a visualization using the following:

# plotter.py

import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from constants import OUTPUT_FOLDER

def plot(df, output_filename):
'''
Returns None.
Plots the data in the dataframe df.
Saves the output to {OUTPUT_FOLDER}/{output_filename}

Effects:
- Writes to {OUTPUT_FOLDER}/{output_filename}

Requires:
- df is a dataframe with two columns: Company and CAD/mo

plot: DataFrame Str -> None
'''
print(df.reset_index())
# Plot the data
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(20, 0.25 * len(df)))
g = sns.barplot(
data=df.reset_index(), y="index", x="CAD/mo",
ax=ax, palette="blend:limegreen,dodgerblue"
)
g.set_title("Average Co-op Salaries")
g.set_xlabel("CAD/mo")
g.set_ylabel("Company")

fig.tight_layout()

path = os.path.join(OUTPUT_FOLDER, output_filename)
fig.savefig(path)

Where to find more design patterns

This is only one of the many design patterns that show the power of Object-Oriented Programming. I have also shown how to implement the Memento pattern in another article. If you would like to learn more about Design Patterns, I highly recommend the book Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma et al (The Gang of Four). I would go so far as to call it the bible of programming.

GitHub Repository

Feel free to check out the source code here: https://github.com/justinsj/interpreter-coop-salaries

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Justin San Juan

Justin San Juan

48 Followers

Award-Winning Software Engineer | Business and Web Consultant