Creating scalable and repeatable functions to augment historical research
Overview of creating data analysis and visualization python function and their potential benefits in analyzing historical records
This blog post has three primary goals:
- Walk through how to build a function intended to help examine primary sources
- Establish the benefit of building repeatable and iterative functions
- Provoke history scholars and researchers to consider using python as an additional investigatory tool
Getting started with functions
One major potential benefit of using programming to augment historical research is its highly scalable, iterative, and repetitive nature. Building functions is a primary way to write code that can be reused and gradually refined. Python functions enable programmers, researchers, and scholars to engage with a wider range and higher number of sources or inputs than they could consider via traditional research methods.
Functions can be very simple or incredibly complex. Think of functions as blocks of instructions that produce wanted outcomes.
- Start by using the
defcommand to declare the function name, add parameters in parentheses, and end the line with a colon:def <function_name>(parameter_1, parameter_2, etc.):- A parameter/argument is information that can be passed back through the function when it is called. You can add any number of parameters in the parentheses divided by commas. Parameter is typically the term used when defining a function, and, when called, you enter in arguments.
- Add indented statements that entail what the functions should execute, including desired outputs (i.e.
print(), show(), return()). There can be multiple outputs. - Once all instructions have been written, un-tab the line and call the function. The function can be imported and called from other locations.
Simple Example Function
- Function: function_name(filename)
- Purpose: When called, the function opens a file inputted as the parameter
filename, reads the text, store it as variabletext, and then closes the file - Parameters:
filename: can be any source file saved in the same environment
#declare the function:
def function_name(filename):
file = open(filename, 'rt')
text = file.read()
file.close()
#call the function:
function_name('file_input_as_parameter.txt')
Building a function to augment historical research
To augment my historical research, I wanted to leveraged data analysis and visualization functionality to investigate gems in text-based records. The following steps outline how I built a gem_frequency_graph function. The function produces a bar chart of the frequency/count gemstones appearance in any inputted file. See this post for a more complex discussion regarding functions that augmented my analysis of the gems and pearls in early modern history.
1. Declare the function and parameters, import any needed libraries/modules
- Function: frequency_gem_graph(filename, title_of_file, author_date)
- Purpose: The function produces a graphic representation of how many times different gems (diamonds, sapphires, pearls, etc.) are mentioned in an inputted text.
- Parameters:
filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment See this post for more details on digital history sources.title_of_file: enter the title of the primary source file as a string. This string element populates a portion of thetitle()function from the matplotlib library that sets the heading for the bar-plotauthor_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e. ('Jean-Baptiste Tavernier, 1678'). This string element populates a portion of thetitle()function from the matplotlib library that sets the sub-heading for the bar-plot
After naming and defining the gem_frequency_graph function, it is best practice to importing any necessary libraries/modules. This includes importing other function that you have built previously. For example, I pre-built file_function() that cleans up my historical source input.
#define your function with clear parameters and import needed libraries
def gem_frequency_graph(filename, title_of_file, author_date):
from clean_up_text import file_function
import matplotlib.pyplot as plt
from nltk.probability import FreqDist
import pandas as pd
import seaborn as sns
from nltk.stem import PorterStemmer
#instead of continually redoing clean up, call a built function
#with the argument from new function of (filename)
text_final = file_function(filename)
2. Add statements that the function should execute
For the gem_frequency_graph function, I wanted to compare how frequently different gems are mentioned in specific text. I augmented a large (cleaned-up) list: text_final to ensure that any variations or plural spellings of the different gems would be counted (i.e. that 'pearls' would be counted in the final count for pearl). I did equate spinel to ruby, though there are debates around this topic. This decisions could easily be changed or even added as its own, permeable parameter to the function.
From there, I leveraged the pandas library to build a frequency distribution dataframe. This changes list data into table form. If you added the output command print(all_fdist) to the function, you would view a table with each gem in the gem_list and its count for the inputted text you called through the function in the terminal.
#augmenting the list to build our frequency dist function
text_final = [i.replace('pearls', 'pearl') for i in text_final]
text_final = [i.replace('jewels', 'jewel') for i in text_final]
text_final = [i.replace('gemstone', 'gem') for i in text_final]
text_final = [i.replace('gems', 'gem') for i in text_final]
text_final = [i.replace('rubies', 'ruby') for i in text_final]
text_final = [i.replace('spinel', 'ruby') for i in text_final]
text_final = [i.replace('emeralds', 'emerald') for i in text_final]
text_final = [i.replace('corals', 'coral') for i in text_final]
text_final = [i.replace('diamonds', 'diamond') for i in text_final]
text_final = [i.replace('ambers', 'amber') for i in text_final]
text_final = [i.replace('sapphires', 'sapphire') for i in text_final]
text_final = [i.replace('jades', 'jade') for i in text_final]
text_final = [i.replace('turquoises', 'turquoise') for i in text_final]
text_final = [i.replace('ivories', 'ivory') for i in text_final]
text_final = [i.replace('garnets', 'garnet') for i in text_final]
gems = ['pearl', 'ruby', 'jewel', 'emerald', 'coral', 'gem', 'diamond', 'sapphire', 'turquoise', 'jade', 'amethyst', 'topaz', 'opal', 'ivory', 'amber', 'catseye', 'alexandrite', 'garnet', 'peridot']
gem_list = [w for w in text_final if w in gems]
#build frequency distribution dataframe
all_fdist = FreqDist(gem_list)
all_fdist = pd.Series(dict(all_fdist)).sort_values(ascending=False)
Then, I used the table from the dataframe all_fdist as an input into the seaborn bar graph (sns.barplot). This transforms the data into a visual representation of the findings!
#build graph; establish your color, format, and other specifications
sns.set()
sns.color_palette("husl", 8)
fig,ax = plt.subplots(figsize=(8,8))
all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)
sns.despine()
plt.xticks(rotation=30)
plt.ylabel('Frequency (Count)', fontsize=8)
plt.xlabel('Gem', fontsize=8)
plt.suptitle('Count mentions of various gemstones in ' + title_of_file, fontsize=14)
plt.title( 'by ' + author_date, fontsize = 10)
#display the graph
plt.show()
3. Call the function
The above code (building the gem count list, the dataframe, the chart, etc.) is indented and, thus, instructions built into the defined gem_frequency_barchart function. In a new, unindented line, you can call the function. Call the function by typing the function name and correctly inputting the arguments/parameters in parentheses.
#example calling the gem frequency bar chart function
gem_frequency_chart('jahangir_wheeler_thackston_translation.txt', '<Jahangirnama>',
'Emperor Jahangir - Translated by Wheeler Thackston')
The above parameters outputs the following graphic representation:

Research, Test, Refine, and Iterate
For my inputs, I practiced using two translations of the Jahangirnama and Jean-Baptiste Tavernier's The Six Voyages of John Baptista Tavernier. Additional text files could be called through the same code by saving the files in the same environment and updating the parameters. The example function is, by no means, perfect. As I continue to expand my python skills, I can alter and improve my functions. With every change, I can run the same or new historical texts through it to see the data and end results.
Functions are easily scalable, repeatable, and permeable. As many programmers or individuals in the technology space know, the clearest path to a better end result is through iteration! After researching or thinking about what you may want to code, try to build it. Iterate your functions, test them often, improve and tweak as necessary, or start new if something is not turning out the way you want it to! The possibilities are limitless.
Happy coding!
Bibliography:
Nur al-Din Jahangir (Jahangir Emperor of Hindustan). The Jahangirnama: Memoirs of Jahangir. Emperor of India. Translated, edited, and annotated by Wheeler M. Thackston. New York: Oxford University Press, 1999.
Tavernier, Jean-Baptiste. The Six Voyages of John Baptista Tavernier. London: Printed for R.L. and M.P., 1678.
