Using Python to trace threads through history
How to build Python functions with pandas, nlkt, seaborn, and plotly to analyze historical sources
This blog post has two primary goals:
- Consider the merits of leveraging programming as an additional research method for analyzing historical sources and writing object-oriented narratives
- Share preliminary findings (code, graphical outputs, and results) that supplement an early modern exploration of the pearl
Project Background
As a technology consultant, I learned to leverage programming to query, process, and evaluate complex data. This work galvanized my desire to integrate technical, digital, and computational approaches with my historical practice.
While there are multitudinous ways to use coding to supplement humanities, my knowledge of object- or lens-based histories provided motivation and direction. I decided to focus on evaluating the use of Python to help trace transient threads through history. Thus, I have built upon an object-oriented piece I had already researched regarding early modern pearls. My understanding of the primary source materials and historical landscape allowed me to quickly consider how programming could augment a historical investigation and examine how I (and others!) could leverage technical research and analysis methods in future works.
Findings Overview

For this object-oriented investigation, I built scalable Python functions using pandas, nltk, matplotlib, seaborn, and plotly that perform textual analysis, recognize object frequency and placement within an imported file, and produce interactive histograms, scatter plots, and other charts to visualize results.
I used functions, blocks of instructions that produce a desired outcome, in order to create repeatable programs to analyze text. Each function intakes a text file or source via parameters, ie information that can be passed back through the function when it is called. For each function, I've shared the explanation, code, and output below. See this post for details on digital history sources and this post for details on building functions.
Functions (Intentions and Code Results)
1. frequency_all_words_graph(filename, title_of_file, author_date, number_of_words):
Outputs bar graph depicting the most frequently used word in any inputted text file (saved in the same environment). Stopwords are not counted as they are removed through pre-built clean-up file_function.
Parameters:
filename: primary source file that function analyzes for most frequently used word; saved as a .txt file in the same environment and cleaned up through pre-built function. See this post for details on pre-building a function to open, read, and clean up .txt historical file typestitle_of_file: enter the title of the primary source file as a string. This string element populates a portion of thetitle()function from the matplotlib library that sets the heading for the bar-plotauthor_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e.('Thomas Coryate, 1611'). This string element populates a portion of thetitle()function from the matplotlib library that sets the sub-heading for the bar-plotnumber_of_words: n inmost_common([n])which returns a list of top 'n' elements from most common to least common. If n is omitted or None, most_common() returns all elements in the counter and will error out. For this bar-plot function, the 'n' of 'number_of_words' impacts quantity of words included in final graph (dictates whether the plot displays the top 20, 50, or 100 most frequently used words)
The function allows me to start exploring the text, the authors' tone, and the primary topics. The updatable parameters and visual end result make this function a quick way to engage with new source materials. After naming and defining the frequency_all_words_graph function (or any function), it is best practice to import any necessary libraries/modules. This includes calling other function that you have built previously.
My Code:
def frequency_all_words_graph(filename, title_of_file, author_date, number_of_words):
from clean_up_text_function import file_function
import matplotlib.pyplot as plt
from nltk.probability import FreqDist
import pandas as pd
import seaborn as sns
from collections import Counter
//call pre-built file_function to open, read, and clean up text file
text_final = file_function(filename)
//create frequency distribution DataFrame with # of words specified
text_cnt = FreqDist(text_final)
common_words = text_cnt.most_common(number_of_words)
common_words = pd.DataFrame(common_words, columns = ['Words', 'Counts'])
//format seaborn barplot
sns.set()
sns.color_palette("husl", 8)
plt.figure(figsize=(10,8))
sns.barplot(y= "Words", x = "Counts", data =common_words)
plt.title('Most Frequent ' + str(number_of_words) +
' Words in ' + title_of_file + '\n' + 'by ' + author_date, fontsize=12)
plt.show()
//to call, untab line or start new page, type function name, and
//input the parameters in parenthesis, example INPUT:
frequency_all_words_graph('coryat_crudities.txt', 'Coryats Crudities',
'Thomas Coryate, 1611', 20)
Example Results (with Coryats Crudities, Thomas Coryate, 1611 as the text input):

2. frequency_gem_graph(filename, title_of_file, author_date):
The function produces a graphic representation of how many times different gems (diamonds, sapphires, pearls, etc.) are mentioned in an inputted text.
Parameters:
filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environmenttitle_of_file: enter the title of the primary source file as a string. This string element populates a portion of thetitle()function from the matplotlib library that sets the heading for the bar-plotauthor_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e.('Jean-Baptiste Tavernier, 1678'). This string element populates a portion of thetitle()function from the matplotlib library that sets the sub-heading for the bar-plotThis function probes into whether pearls hold unique relevance in the text and determines what other gems are discussed and/or mentioned frequently. This program could easily be updated to examine another object (by augmenting the gem-centric
text_finallist and DataFrame).
My Code:
def frequency_gem_graph(filename, title_of_file, author_date):
from clean_up_text import file_function
from nltk.probability import FreqDist
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
//call pre-built function to clean up filename
text_final = file_function(filename)
//augment text_final so output considers both "pearl" and "pearls" = pearl
text_final = [i.replace('pearls', 'pearl') for i in text_final]
text_final = [i.replace('emeralds', 'emerald') for i in text_final]
text_final = [i.replace('diamonds', 'diamond') for i in text_final]
text_final = [i.replace('sapphires', 'sapphire') for i in text_final]
text_final = [i.replace(//etc... complete with rest of gems in gems list
//next, make frequency distribution for gems (note this function could
//be used to analyze other objects by updating the augmentation and list types).
gems = ['pearl', 'emerald', 'diamond', 'sapphire', 'ruby', 'jewel', 'gem', 'coral', 'gem',
'turquoise', 'jade', 'amethyst', 'topaz', 'opal', 'ivory', 'amber',
'catseye', 'alexandrite', 'garnet', 'peridot', 'mother-of-pearl']
gem_list = [w for w in text_final if w in gems]
all_fdist = FreqDist(gem_list)
all_fdist = pd.Series(dict(all_fdist)).sort_values(ascending=False)
//format seaborn barplot
all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)
sns.set()
sns.color_palette("husl", 8)
fig,ax = plt.subplots(figsize=(8,8))
for p in all_plot.patches:
all_plot.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', fontsize = 8,
xytext = (0, 9),
textcoords = 'offset points')
sns.despine()
plt.xticks(rotation=30)
plt.ylabel('Frequency (Count)', fontsize=12)
plt.xlabel('Gem Type', fontsize=12)
plt.title('Count mention of various gemstones in ' + title_of_file + '\n' + 'by ' + author_date, fontsize=12)
plt.show()
//example INPUT:
gem_frequency_graph('tavernier_text.txt', 'The Six Voyages of John Baptista Tavernier', 'Jean-Baptiste Tavernier, 1678')
Example Results (with The Six Voyages of John Baptista Tavernier, by Jean-Baptiste Tavernier, 1678 as the text input):

3. gem_histogram_graph(filename, title_of_file):
This function produces an interactive histogram and rug plot that discloses where in the text different gems are discussed, if there are notable correlations between gems, and which gems hold an irregular or individual role in the text.
Parameters:
filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environmenttitle_of_file: enter the title of the primary source file as a string. This string element informs the heading for the graphic output
The output provides a novel way of interacting with and visualizing the historical source -- graphically threading and displaying the location and frequency of gems through the text. The produced plots would be helpful in evaluating news sources and gauging how they refer to gems (or, similar to the frequency chart above, any specified defined list or category of objects).
My Code:
def gem_histogram_graph(filename, title_of_file):
from clean_up_text import file_function
import matplotlib.pyplot as plt
from nltk.probability import FreqDist
import pandas as pd
import plotly.express as px
good_list_lower = file_function(filename)
//create gem specific lists and counts from tokenized inputted filename
pearl = []
i = 0
while i < len(good_list_lower):
if good_list_lower[i] == 'pearls' or good_list_lower[i] == 'pearl':
pearl.append(i)
i = i + 1
diamond = []
i = 0
while i < len(good_list_lower):
if good_list_lower[i] == 'diamond' or good_list_lower[i] == 'diamonds':
diamond.append(i)
i = i + 1
ruby = []
i = 0
while i < len(good_list_lower):
if good_list_lower[i] == 'ruby' or good_list_lower[i] == 'rubies' or good_list_lower[i] == 'spinel':
ruby.append(i)
i = i + 1
emerald = []
i = 0
while i < len(good_list_lower):
if good_list_lower[i] == 'emerald' or good_list_lower[i] == 'emeralds':
emerald.append(i)
i = i + 1
sapphire = []
i = 0
while i < len(good_list_lower):
if good_list_lower[i] == 'sapphire' or good_list_lower[i] == 'sapphires':
sapphire.append(i)
i = i + 1
//create and merge the five gem specific dataframe
dfp = pd.DataFrame({'pearls':pearl})
dfr = pd.DataFrame({'rubies':ruby})
dfe = pd.DataFrame({'emeralds':emerald})
dfs = pd.DataFrame ({'sapphire':sapphire})
dfd = pd.DataFrame ({'diamond':diamond})
df1 = dfp.merge(dfr, left_index=True, right_index=True, how='outer')
df2 = df1.merge(dfe, left_index=True, right_index=True, how='outer')
df3 = df2.merge(dfs, left_index=True, right_index=True, how='outer')
df = df3.merge(dfd, left_index=True, right_index=True, how='outer')
//create & format plotly histogram
fig = px.histogram(df, opacity=0.8, nbins=35, marginal='rug',
color_discrete_sequence=["#FFBD00", "#FF5768", "#4EC29D", "#0065a2", "#8376AA"])
fig.update_layout(yaxis_title="Count", xaxis_title='Location in Book Histogram (where in the book do the mentions occur)')
fig.update_traces(opacity=0.80)
fig.update_xaxes(showticklabels=False)
fig.show()
//example INPUT:
gem_histogram_graph('jahangir_wheeler_thackston_translation.txt', 'Jahangirnama, Wheeler Thackston Translation')
Example Results (with The Jahangirnama, by Nur al-Din Jahangir (Jahangir Emperor of Hindustan) and translated by Wheeler M. Thackston as the text input):

4. pearl_sentence_rug(filename):
This function's output, a pearl-mention specific interactive rug plot, reveals how the pearl is described in the inputted text, what its typical context is, and whether there is a specific section of the book that mentions pearls more frequently.
Parameters:
filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environmenttitle_of_file: enter the title of the primary source file as a string. This string element informs the heading for the graphic output
The rug plot could be helpful in gauging potential new sources and provide directions to where to start a deeper reading or investigation. It provides an alternative and atypical way to both engage with the source's subject, sentences, and syntax.
My Code:
def pearl_sentence_rug(filename, title_of_file):
from FUN_clean_up_text import file_function
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
from plotly.validators.scatter.marker import SymbolValidator
import pandas as pd
text_final = file_function(filename)
//identify index location of pearl mentions within text file
pearl_index = []
i = 0
while i < len(text_final):
if text_final[i] == 'pearls' or text_final[i] == 'pearl':
pearl_index.append(i)
i = i + 1
//identify pearl sentences
sentence_one = re.findall(r"([^.]*?pearl[^.]*\ .)",text)
sentence_two = re.findall(r"([^.]*?pearls[^.]*\ .)",text)
all_sentences = sentence_one + sentence_two
list_tuples = list(zip(all_sentences, pearl_index))
//create DataFrame with pearl sentence and index location
df = pd.DataFrame(list_tuples, columns = ['Pearl Sentence', 'Index Location'])
df["Sentence Key Word"] = 'Pearl'
//create and format plotly graph
fig = px.scatter(df, y="Sentence Key Word", x="Index Location",
hover_data=['Pearl Sentence'], color="Index Location", color_continuous_scale=px.colors.diverging.Temps)
fig.update_traces(marker_symbol='line-ns-open',
marker_line_width=2.5, marker_size=70)
fig.update_yaxes(showticklabels=False)
fig.update_layout(yaxis_title="", xaxis_title='Index Location(location of mentions of pearl in inputted work)', font=dict(
size=14,
color="#1B6262"), title='Sentences containing pearl identified in ' + title_of_file +
'. Hover over to see the sentences!')
fig.show()
//example INPUT:
pearl_sentence_rug('jahangir_wheeler_thackston_translation.txt', 'The Jahangirnama')
Example Results (with The Jahangirnama, by Nur al-Din Jahangir (Jahangir Emperor of Hindustan) and translated by Wheeler M. Thackston as the text input):

Results Analysis
As a historian, I have looked to employ concentrated lenses to craft transregional and transdisciplinary histories. While researching and writing, I have relied upon traditional history research practices and tools at my disposal--primarily in-depth readings of selected documents and detailed analyses of other early modern materials. Returning to past sources and works has allowed me to consider the benefits of using data analysis, visualization, and other computational methods to augment historical research.
The four python functions shared in this blog only demonstrate the surface or potential of what could be built to augment an object-oriented or thread-tracing historical research project. My efforts and results are duly limited as (1) I only explored text files, (2) I used a small sample source size, (3) my knowledge of available python libraries or modules is still developing, and, (4) the time I have available!
That being said, using code can offer benefits that traditional readings of pre-selected documents cannot provide, such as the ability to:
- Rapidly analyze many sources through the same function and evaluate the potential usefulness of sources (While I only explored written-texts, the input sources can be other digitized sources such as databases, auditory files, or visual files.)
- Discover larger patterns or correlations between various documents or text corpa
- Visually reappraise and engage with sources in a nontraditional manner
- Perform additional, objective analysis of a text's central meaning, topics, focus, and sentiment (I have not yet explored the sentiment analysis models of the
nltklibrary, but am excited to learn more about it) - Constantly augment, iterate, and improve programs that are both self-authored or available open source
There are seemingly limitless avenues for scalable and repeatable explorations into historical source inputs--especially for object-oriented projects. I have only begun to scratch the surface of possibilities and am eagerly growing my programming skills!
Happy Coding!
Please refer to other posts in my blog or external public repositories such as GitHub or the Programming Historian for basic information on python, using programming for text-based analysis, historical source considerations, and other cool articles.
Bibliography:
Coryate, Thomas, and George Coryate. Coryat's Crudities, vol I & II. Glasgow: J. MacLehose and Sons, 1905.
Harper, Charlie. "Visualizing Data with Bokeh and Pandas," Programming Historian 7, 2018. doi.org/10.46430/phen0081
Nur al-Din Jahangir (Jahangir Emperor of Hindustan). The Jahangirnama: Memoirs of Jahangir. Emperor of India. Translated, edited, and annotated by Wheeler M. Thackston. New York: Oxford University Press, 1999.
Tavernier, Jean-Baptiste. The Six Voyages of John Baptista Tavernier. London: Printed for R.L. and M.P., 1678.
