Using Python to trace threads through history

This blog post has two primary goals:

Consider the merits of leveraging programming as an additional research method for analyzing historical sources and writing object-oriented narratives
Share preliminary findings (code, graphical outputs, and results) that supplement an early modern exploration of the pearl

Project Background

As a technology consultant, I learned to leverage programming to query, process, and evaluate complex data. This work galvanized my desire to integrate technical, digital, and computational approaches with my historical practice.

While there are multitudinous ways to use coding to supplement humanities, my knowledge of object- or lens-based histories provided motivation and direction. I decided to focus on evaluating the use of Python to help trace transient threads through history. Thus, I have built upon an object-oriented piece I had already researched regarding early modern pearls. My understanding of the primary source materials and historical landscape allowed me to quickly consider how programming could augment a historical investigation and examine how I (and others!) could leverage technical research and analysis methods in future works.

Findings Overview

ezgif.com-gif-maker (2).gif

For this object-oriented investigation, I built scalable Python functions using pandas, nltk, matplotlib, seaborn, and plotly that perform textual analysis, recognize object frequency and placement within an imported file, and produce interactive histograms, scatter plots, and other charts to visualize results. I used functions, blocks of instructions that produce a desired outcome, in order to create repeatable programs to analyze text. Each function intakes a text file or source via parameters, ie information that can be passed back through the function when it is called. For each function, I've shared the explanation, code, and output below. See this post for details on digital history sources and this post for details on building functions.

Functions (Intentions and Code Results)

1. frequency_all_words_graph(filename, title_of_file, author_date, number_of_words):

Outputs bar graph depicting the most frequently used word in any inputted text file (saved in the same environment). Stopwords are not counted as they are removed through pre-built clean-up file_function.

Parameters:

filename: primary source file that function analyzes for most frequently used word; saved as a .txt file in the same environment and cleaned up through pre-built function. See this post for details on pre-building a function to open, read, and clean up .txt historical file types
title_of_file: enter the title of the primary source file as a string. This string element populates a portion of the title() function from the matplotlib library that sets the heading for the bar-plot
author_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e. ('Thomas Coryate, 1611'). This string element populates a portion of the title() function from the matplotlib library that sets the sub-heading for the bar-plot
number_of_words: n in most_common([n]) which returns a list of top 'n' elements from most common to least common. If n is omitted or None, most_common() returns all elements in the counter and will error out. For this bar-plot function, the 'n' of 'number_of_words' impacts quantity of words included in final graph (dictates whether the plot displays the top 20, 50, or 100 most frequently used words)

The function allows me to start exploring the text, the authors' tone, and the primary topics. The updatable parameters and visual end result make this function a quick way to engage with new source materials. After naming and defining the frequency_all_words_graph function (or any function), it is best practice to import any necessary libraries/modules. This includes calling other function that you have built previously.

My Code:

def frequency_all_words_graph(filename, title_of_file, author_date, number_of_words):
    from clean_up_text_function import file_function
    import matplotlib.pyplot as plt
    from nltk.probability import FreqDist
    import pandas as pd
    import seaborn as sns
    from collections import Counter

//call pre-built file_function to open, read, and clean up text file
    text_final = file_function(filename)

//create frequency distribution DataFrame with # of words specified
    text_cnt = FreqDist(text_final)
    common_words = text_cnt.most_common(number_of_words)
    common_words = pd.DataFrame(common_words, columns = ['Words', 'Counts'])

//format seaborn barplot    
    sns.set()
    sns.color_palette("husl", 8)
    plt.figure(figsize=(10,8)) 
    sns.barplot(y= "Words", x = "Counts", data =common_words)
    plt.title('Most Frequent ' + str(number_of_words) + 
             ' Words in ' + title_of_file + '\n' + 'by ' + author_date, fontsize=12)
    plt.show()

//to call, untab line or start new page, type function name, and
//input the parameters in parenthesis, example INPUT:
frequency_all_words_graph('coryat_crudities.txt', 'Coryats Crudities',
 'Thomas Coryate, 1611', 20)

Example Results (with Coryats Crudities, Thomas Coryate, 1611 as the text input):

Screen Shot 2021-10-27 at 11.12.47 PM.png

2. frequency_gem_graph(filename, title_of_file, author_date):

The function produces a graphic representation of how many times different gems (diamonds, sapphires, pearls, etc.) are mentioned in an inputted text.

Parameters:

filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment
title_of_file: enter the title of the primary source file as a string. This string element populates a portion of the title() function from the matplotlib library that sets the heading for the bar-plot
author_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e. ('Jean-Baptiste Tavernier, 1678'). This string element populates a portion of the title() function from the matplotlib library that sets the sub-heading for the bar-plot

This function probes into whether pearls hold unique relevance in the text and determines what other gems are discussed and/or mentioned frequently. This program could easily be updated to examine another object (by augmenting the gem-centric text_final list and DataFrame).

My Code:

def frequency_gem_graph(filename, title_of_file, author_date):
    from clean_up_text import file_function
    from nltk.probability import FreqDist
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns

//call pre-built function to clean up filename
    text_final = file_function(filename)

//augment text_final so output considers both  "pearl" and "pearls" = pearl
    text_final = [i.replace('pearls', 'pearl') for i in text_final]
    text_final = [i.replace('emeralds', 'emerald') for i in text_final]
    text_final = [i.replace('diamonds', 'diamond') for i in text_final]
    text_final = [i.replace('sapphires', 'sapphire') for i in text_final]
    text_final = [i.replace(//etc... complete with rest of gems in gems list

//next, make frequency distribution for gems (note this function could
//be used to analyze other objects by updating the augmentation and list types).
    gems = ['pearl', 'emerald', 'diamond', 'sapphire', 'ruby', 'jewel', 'gem', 'coral', 'gem',
 'turquoise', 'jade', 'amethyst', 'topaz', 'opal', 'ivory', 'amber', 
'catseye', 'alexandrite', 'garnet', 'peridot', 'mother-of-pearl']
    gem_list = [w for w in text_final if w in gems]
    all_fdist = FreqDist(gem_list)
    all_fdist = pd.Series(dict(all_fdist)).sort_values(ascending=False)

//format seaborn barplot
    all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)
    sns.set()
    sns.color_palette("husl", 8)
    fig,ax = plt.subplots(figsize=(8,8))
    for p in all_plot.patches:
        all_plot.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', fontsize = 8,
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    sns.despine()
    plt.xticks(rotation=30)
    plt.ylabel('Frequency (Count)', fontsize=12)
    plt.xlabel('Gem Type', fontsize=12)
    plt.title('Count mention of various gemstones in ' + title_of_file + '\n' + 'by ' + author_date, fontsize=12)
    plt.show()

//example INPUT:
gem_frequency_graph('tavernier_text.txt', 'The Six Voyages of John Baptista Tavernier', 'Jean-Baptiste Tavernier, 1678')

Example Results (with The Six Voyages of John Baptista Tavernier, by Jean-Baptiste Tavernier, 1678 as the text input):

Screen Shot 2021-10-27 at 11.46.18 PM.png

3. gem_histogram_graph(filename, title_of_file):

This function produces an interactive histogram and rug plot that discloses where in the text different gems are discussed, if there are notable correlations between gems, and which gems hold an irregular or individual role in the text.

Parameters:

filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment
title_of_file: enter the title of the primary source file as a string. This string element informs the heading for the graphic output

The output provides a novel way of interacting with and visualizing the historical source -- graphically threading and displaying the location and frequency of gems through the text. The produced plots would be helpful in evaluating news sources and gauging how they refer to gems (or, similar to the frequency chart above, any specified defined list or category of objects).

My Code:

def gem_histogram_graph(filename, title_of_file):
    from clean_up_text import file_function
    import matplotlib.pyplot as plt
    from nltk.probability import FreqDist   
    import pandas as pd
    import plotly.express as px

    good_list_lower = file_function(filename)

//create gem specific lists and counts from tokenized inputted filename
    pearl = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'pearls' or good_list_lower[i] == 'pearl':
            pearl.append(i)
        i = i + 1

    diamond = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'diamond' or good_list_lower[i] == 'diamonds':
            diamond.append(i)
        i = i + 1    

    ruby = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'ruby' or good_list_lower[i] == 'rubies' or good_list_lower[i] == 'spinel':
            ruby.append(i)
        i = i + 1

    emerald = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'emerald' or good_list_lower[i] == 'emeralds':
            emerald.append(i)
        i = i + 1

    sapphire = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'sapphire' or good_list_lower[i] == 'sapphires':
            sapphire.append(i)
        i = i + 1

    //create and merge the five gem specific dataframe 
    dfp = pd.DataFrame({'pearls':pearl})
    dfr = pd.DataFrame({'rubies':ruby})
    dfe = pd.DataFrame({'emeralds':emerald})
    dfs = pd.DataFrame ({'sapphire':sapphire})
    dfd = pd.DataFrame ({'diamond':diamond})
    df1 = dfp.merge(dfr, left_index=True, right_index=True, how='outer')
    df2 = df1.merge(dfe, left_index=True, right_index=True, how='outer')
    df3 = df2.merge(dfs, left_index=True, right_index=True, how='outer')
    df = df3.merge(dfd, left_index=True, right_index=True, how='outer')

    //create & format plotly histogram 
    fig = px.histogram(df, opacity=0.8, nbins=35, marginal='rug', 
        color_discrete_sequence=["#FFBD00", "#FF5768", "#4EC29D", "#0065a2", "#8376AA"])
    fig.update_layout(yaxis_title="Count", xaxis_title='Location in Book Histogram (where in the book do the mentions occur)') 
    fig.update_traces(opacity=0.80)
    fig.update_xaxes(showticklabels=False)
    fig.show()  

//example INPUT:
gem_histogram_graph('jahangir_wheeler_thackston_translation.txt', 'Jahangirnama, Wheeler Thackston Translation')

Example Results (with The Jahangirnama, by Nur al-Din Jahangir (Jahangir Emperor of Hindustan) and translated by Wheeler M. Thackston as the text input):

ezgif.com-gif-maker (3).gif

4. pearl_sentence_rug(filename):

This function's output, a pearl-mention specific interactive rug plot, reveals how the pearl is described in the inputted text, what its typical context is, and whether there is a specific section of the book that mentions pearls more frequently.

Parameters:

filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment
title_of_file: enter the title of the primary source file as a string. This string element informs the heading for the graphic output

The rug plot could be helpful in gauging potential new sources and provide directions to where to start a deeper reading or investigation. It provides an alternative and atypical way to both engage with the source's subject, sentences, and syntax.

My Code:

def pearl_sentence_rug(filename, title_of_file):
    from FUN_clean_up_text import file_function
    import pandas as pd
    import plotly.express as px
    import plotly.figure_factory as ff
    from plotly.validators.scatter.marker import SymbolValidator
    import pandas as pd

    text_final = file_function(filename)

//identify index location of pearl mentions within text file
    pearl_index = []
    i = 0
    while i < len(text_final):
        if text_final[i] == 'pearls' or text_final[i] == 'pearl':
            pearl_index.append(i)
        i = i + 1

//identify pearl sentences
    sentence_one = re.findall(r"([^.]*?pearl[^.]*\ .)",text)
    sentence_two = re.findall(r"([^.]*?pearls[^.]*\ .)",text)
    all_sentences = sentence_one + sentence_two
    list_tuples = list(zip(all_sentences, pearl_index))

//create DataFrame with pearl sentence and index location
    df = pd.DataFrame(list_tuples, columns = ['Pearl Sentence', 'Index Location'])
    df["Sentence Key Word"] = 'Pearl'

//create and format plotly graph
    fig = px.scatter(df, y="Sentence Key Word", x="Index Location", 
    hover_data=['Pearl Sentence'], color="Index Location", color_continuous_scale=px.colors.diverging.Temps)

    fig.update_traces(marker_symbol='line-ns-open',
                            marker_line_width=2.5, marker_size=70)
    fig.update_yaxes(showticklabels=False)
    fig.update_layout(yaxis_title="", xaxis_title='Index Location(location of mentions of pearl in inputted work)', font=dict(
            size=14,
            color="#1B6262"), title='Sentences containing pearl identified in ' + title_of_file +  
            '.  Hover over to see the sentences!')
    fig.show()

//example INPUT:
pearl_sentence_rug('jahangir_wheeler_thackston_translation.txt', 'The Jahangirnama')

Example Results (with The Jahangirnama, by Nur al-Din Jahangir (Jahangir Emperor of Hindustan) and translated by Wheeler M. Thackston as the text input):

Results Analysis

As a historian, I have looked to employ concentrated lenses to craft transregional and transdisciplinary histories. While researching and writing, I have relied upon traditional history research practices and tools at my disposal--primarily in-depth readings of selected documents and detailed analyses of other early modern materials. Returning to past sources and works has allowed me to consider the benefits of using data analysis, visualization, and other computational methods to augment historical research.

The four python functions shared in this blog only demonstrate the surface or potential of what could be built to augment an object-oriented or thread-tracing historical research project. My efforts and results are duly limited as (1) I only explored text files, (2) I used a small sample source size, (3) my knowledge of available python libraries or modules is still developing, and, (4) the time I have available!

That being said, using code can offer benefits that traditional readings of pre-selected documents cannot provide, such as the ability to:

Rapidly analyze many sources through the same function and evaluate the potential usefulness of sources (While I only explored written-texts, the input sources can be other digitized sources such as databases, auditory files, or visual files.)
Discover larger patterns or correlations between various documents or text corpa
Visually reappraise and engage with sources in a nontraditional manner
Perform additional, objective analysis of a text's central meaning, topics, focus, and sentiment (I have not yet explored the sentiment analysis models of the nltk library, but am excited to learn more about it)
Constantly augment, iterate, and improve programs that are both self-authored or available open source

There are seemingly limitless avenues for scalable and repeatable explorations into historical source inputs--especially for object-oriented projects. I have only begun to scratch the surface of possibilities and am eagerly growing my programming skills!

Happy Coding!

Please refer to other posts in my blog or external public repositories such as GitHub or the Programming Historian for basic information on python, using programming for text-based analysis, historical source considerations, and other cool articles.

Bibliography:

Coryate, Thomas, and George Coryate. Coryat's Crudities, vol I & II. Glasgow: J. MacLehose and Sons, 1905.

Harper, Charlie. "Visualizing Data with Bokeh and Pandas," Programming Historian 7, 2018. doi.org/10.46430/phen0081

Nur al-Din Jahangir (Jahangir Emperor of Hindustan). The Jahangirnama: Memoirs of Jahangir. Emperor of India. Translated, edited, and annotated by Wheeler M. Thackston. New York: Oxford University Press, 1999.

Tavernier, Jean-Baptiste. The Six Voyages of John Baptista Tavernier. London: Printed for R.L. and M.P., 1678.

Using Python to trace threads through history

Comments

More from this blog

Creating scalable and repeatable functions to augment historical research

Establishing and transforming your digital history sources

Project Background

Findings Overview

Functions (Intentions and Code Results)

Results Analysis

Command Palette

Comments

More from this blog

Project Background

Findings Overview

Functions (Intentions and Code Results)

Results Analysis