Encoding History

Using Python to trace threads through history

Natasha — Thu, 28 Oct 2021 06:49:43 GMT

This blog post has two primary goals:

Consider the merits of leveraging programming as an additional research method for analyzing historical sources and writing object-oriented narratives
Share preliminary findings (code, graphical outputs, and results) that supplement an early modern exploration of the pearl

Project Background

As a technology consultant, I learned to leverage programming to query, process, and evaluate complex data. This work galvanized my desire to integrate technical, digital, and computational approaches with my historical practice.

While there are multitudinous ways to use coding to supplement humanities, my knowledge of object- or lens-based histories provided motivation and direction. I decided to focus on evaluating the use of Python to help trace transient threads through history. Thus, I have built upon an object-oriented piece I had already researched regarding early modern pearls. My understanding of the primary source materials and historical landscape allowed me to quickly consider how programming could augment a historical investigation and examine how I (and others!) could leverage technical research and analysis methods in future works.

Findings Overview

For this object-oriented investigation, I built scalable Python functions using pandas, nltk, matplotlib, seaborn, and plotly that perform textual analysis, recognize object frequency and placement within an imported file, and produce interactive histograms, scatter plots, and other charts to visualize results. I used functions, blocks of instructions that produce a desired outcome, in order to create repeatable programs to analyze text. Each function intakes a text file or source via parameters, ie information that can be passed back through the function when it is called. For each function, I've shared the explanation, code, and output below. See this post for details on digital history sources and this post for details on building functions.

Functions (Intentions and Code Results)

1. frequency_all_words_graph(filename, title_of_file, author_date, number_of_words):

Outputs bar graph depicting the most frequently used word in any inputted text file (saved in the same environment). Stopwords are not counted as they are removed through pre-built clean-up file_function.

Parameters:

filename: primary source file that function analyzes for most frequently used word; saved as a .txt file in the same environment and cleaned up through pre-built function. See this post for details on pre-building a function to open, read, and clean up .txt historical file types
title_of_file: enter the title of the primary source file as a string. This string element populates a portion of the title() function from the matplotlib library that sets the heading for the bar-plot
author_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e. ('Thomas Coryate, 1611'). This string element populates a portion of the title() function from the matplotlib library that sets the sub-heading for the bar-plot
number_of_words: n in most_common([n]) which returns a list of top 'n' elements from most common to least common. If n is omitted or None, most_common() returns all elements in the counter and will error out. For this bar-plot function, the 'n' of 'number_of_words' impacts quantity of words included in final graph (dictates whether the plot displays the top 20, 50, or 100 most frequently used words)

The function allows me to start exploring the text, the authors' tone, and the primary topics. The updatable parameters and visual end result make this function a quick way to engage with new source materials. After naming and defining the frequency_all_words_graph function (or any function), it is best practice to import any necessary libraries/modules. This includes calling other function that you have built previously.

My Code:

def frequency_all_words_graph(filename, title_of_file, author_date, number_of_words):
    from clean_up_text_function import file_function
    import matplotlib.pyplot as plt
    from nltk.probability import FreqDist
    import pandas as pd
    import seaborn as sns
    from collections import Counter

//call pre-built file_function to open, read, and clean up text file
    text_final = file_function(filename)

//create frequency distribution DataFrame with # of words specified
    text_cnt = FreqDist(text_final)
    common_words = text_cnt.most_common(number_of_words)
    common_words = pd.DataFrame(common_words, columns = ['Words', 'Counts'])

//format seaborn barplot    
    sns.set()
    sns.color_palette("husl", 8)
    plt.figure(figsize=(10,8)) 
    sns.barplot(y= "Words", x = "Counts", data =common_words)
    plt.title('Most Frequent ' + str(number_of_words) + 
             ' Words in ' + title_of_file + '\n' + 'by ' + author_date, fontsize=12)
    plt.show()

//to call, untab line or start new page, type function name, and
//input the parameters in parenthesis, example INPUT:
frequency_all_words_graph('coryat_crudities.txt', 'Coryats Crudities',
 'Thomas Coryate, 1611', 20)

Example Results (with Coryats Crudities, Thomas Coryate, 1611 as the text input):

2. frequency_gem_graph(filename, title_of_file, author_date):

The function produces a graphic representation of how many times different gems (diamonds, sapphires, pearls, etc.) are mentioned in an inputted text.

Parameters:

filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment
title_of_file: enter the title of the primary source file as a string. This string element populates a portion of the title() function from the matplotlib library that sets the heading for the bar-plot
author_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e. ('Jean-Baptiste Tavernier, 1678'). This string element populates a portion of the title() function from the matplotlib library that sets the sub-heading for the bar-plot

This function probes into whether pearls hold unique relevance in the text and determines what other gems are discussed and/or mentioned frequently. This program could easily be updated to examine another object (by augmenting the gem-centric text_final list and DataFrame).

My Code:

def frequency_gem_graph(filename, title_of_file, author_date):
    from clean_up_text import file_function
    from nltk.probability import FreqDist
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns

//call pre-built function to clean up filename
    text_final = file_function(filename)

//augment text_final so output considers both  "pearl" and "pearls" = pearl
    text_final = [i.replace('pearls', 'pearl') for i in text_final]
    text_final = [i.replace('emeralds', 'emerald') for i in text_final]
    text_final = [i.replace('diamonds', 'diamond') for i in text_final]
    text_final = [i.replace('sapphires', 'sapphire') for i in text_final]
    text_final = [i.replace(//etc... complete with rest of gems in gems list

//next, make frequency distribution for gems (note this function could
//be used to analyze other objects by updating the augmentation and list types).
    gems = ['pearl', 'emerald', 'diamond', 'sapphire', 'ruby', 'jewel', 'gem', 'coral', 'gem',
 'turquoise', 'jade', 'amethyst', 'topaz', 'opal', 'ivory', 'amber', 
'catseye', 'alexandrite', 'garnet', 'peridot', 'mother-of-pearl']
    gem_list = [w for w in text_final if w in gems]
    all_fdist = FreqDist(gem_list)
    all_fdist = pd.Series(dict(all_fdist)).sort_values(ascending=False)

//format seaborn barplot
    all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)
    sns.set()
    sns.color_palette("husl", 8)
    fig,ax = plt.subplots(figsize=(8,8))
    for p in all_plot.patches:
        all_plot.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', fontsize = 8,
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    sns.despine()
    plt.xticks(rotation=30)
    plt.ylabel('Frequency (Count)', fontsize=12)
    plt.xlabel('Gem Type', fontsize=12)
    plt.title('Count mention of various gemstones in ' + title_of_file + '\n' + 'by ' + author_date, fontsize=12)
    plt.show()

//example INPUT:
gem_frequency_graph('tavernier_text.txt', 'The Six Voyages of John Baptista Tavernier', 'Jean-Baptiste Tavernier, 1678')

Example Results (with The Six Voyages of John Baptista Tavernier, by Jean-Baptiste Tavernier, 1678 as the text input):

3. gem_histogram_graph(filename, title_of_file):

This function produces an interactive histogram and rug plot that discloses where in the text different gems are discussed, if there are notable correlations between gems, and which gems hold an irregular or individual role in the text.

Parameters:

filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment
title_of_file: enter the title of the primary source file as a string. This string element informs the heading for the graphic output

The output provides a novel way of interacting with and visualizing the historical source -- graphically threading and displaying the location and frequency of gems through the text. The produced plots would be helpful in evaluating news sources and gauging how they refer to gems (or, similar to the frequency chart above, any specified defined list or category of objects).

My Code:

def gem_histogram_graph(filename, title_of_file):
    from clean_up_text import file_function
    import matplotlib.pyplot as plt
    from nltk.probability import FreqDist   
    import pandas as pd
    import plotly.express as px

    good_list_lower = file_function(filename)

//create gem specific lists and counts from tokenized inputted filename
    pearl = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'pearls' or good_list_lower[i] == 'pearl':
            pearl.append(i)
        i = i + 1

    diamond = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'diamond' or good_list_lower[i] == 'diamonds':
            diamond.append(i)
        i = i + 1    

    ruby = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'ruby' or good_list_lower[i] == 'rubies' or good_list_lower[i] == 'spinel':
            ruby.append(i)
        i = i + 1

    emerald = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'emerald' or good_list_lower[i] == 'emeralds':
            emerald.append(i)
        i = i + 1

    sapphire = []
    i = 0
    while i < len(good_list_lower):
        if good_list_lower[i] == 'sapphire' or good_list_lower[i] == 'sapphires':
            sapphire.append(i)
        i = i + 1

    //create and merge the five gem specific dataframe 
    dfp = pd.DataFrame({'pearls':pearl})
    dfr = pd.DataFrame({'rubies':ruby})
    dfe = pd.DataFrame({'emeralds':emerald})
    dfs = pd.DataFrame ({'sapphire':sapphire})
    dfd = pd.DataFrame ({'diamond':diamond})
    df1 = dfp.merge(dfr, left_index=True, right_index=True, how='outer')
    df2 = df1.merge(dfe, left_index=True, right_index=True, how='outer')
    df3 = df2.merge(dfs, left_index=True, right_index=True, how='outer')
    df = df3.merge(dfd, left_index=True, right_index=True, how='outer')

    //create & format plotly histogram 
    fig = px.histogram(df, opacity=0.8, nbins=35, marginal='rug', 
        color_discrete_sequence=["#FFBD00", "#FF5768", "#4EC29D", "#0065a2", "#8376AA"])
    fig.update_layout(yaxis_title="Count", xaxis_title='Location in Book Histogram (where in the book do the mentions occur)') 
    fig.update_traces(opacity=0.80)
    fig.update_xaxes(showticklabels=False)
    fig.show()  

//example INPUT:
gem_histogram_graph('jahangir_wheeler_thackston_translation.txt', 'Jahangirnama, Wheeler Thackston Translation')

Example Results (with The Jahangirnama, by Nur al-Din Jahangir (Jahangir Emperor of Hindustan) and translated by Wheeler M. Thackston as the text input):

4. pearl_sentence_rug(filename):

This function's output, a pearl-mention specific interactive rug plot, reveals how the pearl is described in the inputted text, what its typical context is, and whether there is a specific section of the book that mentions pearls more frequently.

Parameters:

filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment
title_of_file: enter the title of the primary source file as a string. This string element informs the heading for the graphic output

The rug plot could be helpful in gauging potential new sources and provide directions to where to start a deeper reading or investigation. It provides an alternative and atypical way to both engage with the source's subject, sentences, and syntax.

My Code:

def pearl_sentence_rug(filename, title_of_file):
    from FUN_clean_up_text import file_function
    import pandas as pd
    import plotly.express as px
    import plotly.figure_factory as ff
    from plotly.validators.scatter.marker import SymbolValidator
    import pandas as pd

    text_final = file_function(filename)

//identify index location of pearl mentions within text file
    pearl_index = []
    i = 0
    while i < len(text_final):
        if text_final[i] == 'pearls' or text_final[i] == 'pearl':
            pearl_index.append(i)
        i = i + 1

//identify pearl sentences
    sentence_one = re.findall(r"([^.]*?pearl[^.]*\ .)",text)
    sentence_two = re.findall(r"([^.]*?pearls[^.]*\ .)",text)
    all_sentences = sentence_one + sentence_two
    list_tuples = list(zip(all_sentences, pearl_index))

//create DataFrame with pearl sentence and index location
    df = pd.DataFrame(list_tuples, columns = ['Pearl Sentence', 'Index Location'])
    df["Sentence Key Word"] = 'Pearl'

//create and format plotly graph
    fig = px.scatter(df, y="Sentence Key Word", x="Index Location", 
    hover_data=['Pearl Sentence'], color="Index Location", color_continuous_scale=px.colors.diverging.Temps)

    fig.update_traces(marker_symbol='line-ns-open',
                            marker_line_width=2.5, marker_size=70)
    fig.update_yaxes(showticklabels=False)
    fig.update_layout(yaxis_title="", xaxis_title='Index Location(location of mentions of pearl in inputted work)', font=dict(
            size=14,
            color="#1B6262"), title='Sentences containing pearl identified in ' + title_of_file +  
            '.  Hover over to see the sentences!')
    fig.show()

//example INPUT:
pearl_sentence_rug('jahangir_wheeler_thackston_translation.txt', 'The Jahangirnama')

Example Results (with The Jahangirnama, by Nur al-Din Jahangir (Jahangir Emperor of Hindustan) and translated by Wheeler M. Thackston as the text input):

Results Analysis

As a historian, I have looked to employ concentrated lenses to craft transregional and transdisciplinary histories. While researching and writing, I have relied upon traditional history research practices and tools at my disposal--primarily in-depth readings of selected documents and detailed analyses of other early modern materials. Returning to past sources and works has allowed me to consider the benefits of using data analysis, visualization, and other computational methods to augment historical research.

The four python functions shared in this blog only demonstrate the surface or potential of what could be built to augment an object-oriented or thread-tracing historical research project. My efforts and results are duly limited as (1) I only explored text files, (2) I used a small sample source size, (3) my knowledge of available python libraries or modules is still developing, and, (4) the time I have available!

That being said, using code can offer benefits that traditional readings of pre-selected documents cannot provide, such as the ability to:

Rapidly analyze many sources through the same function and evaluate the potential usefulness of sources (While I only explored written-texts, the input sources can be other digitized sources such as databases, auditory files, or visual files.)
Discover larger patterns or correlations between various documents or text corpa
Visually reappraise and engage with sources in a nontraditional manner
Perform additional, objective analysis of a text's central meaning, topics, focus, and sentiment (I have not yet explored the sentiment analysis models of the nltk library, but am excited to learn more about it)
Constantly augment, iterate, and improve programs that are both self-authored or available open source

There are seemingly limitless avenues for scalable and repeatable explorations into historical source inputs--especially for object-oriented projects. I have only begun to scratch the surface of possibilities and am eagerly growing my programming skills!

Happy Coding!

Please refer to other posts in my blog or external public repositories such as GitHub or the Programming Historian for basic information on python, using programming for text-based analysis, historical source considerations, and other cool articles.

Bibliography:

Coryate, Thomas, and George Coryate. Coryat's Crudities, vol I & II. Glasgow: J. MacLehose and Sons, 1905.

Harper, Charlie. "Visualizing Data with Bokeh and Pandas," Programming Historian 7, 2018. doi.org/10.46430/phen0081

Nur al-Din Jahangir (Jahangir Emperor of Hindustan). The Jahangirnama: Memoirs of Jahangir. Emperor of India. Translated, edited, and annotated by Wheeler M. Thackston. New York: Oxford University Press, 1999.

Tavernier, Jean-Baptiste. The Six Voyages of John Baptista Tavernier. London: Printed for R.L. and M.P., 1678.

Creating scalable and repeatable functions to augment historical research

Natasha — Wed, 20 Oct 2021 21:55:44 GMT

This blog post has three primary goals:

Walk through how to build a function intended to help examine primary sources
Establish the benefit of building repeatable and iterative functions
Provoke history scholars and researchers to consider using python as an additional investigatory tool

Getting started with functions

One major potential benefit of using programming to augment historical research is its highly scalable, iterative, and repetitive nature. Building functions is a primary way to write code that can be reused and gradually refined. Python functions enable programmers, researchers, and scholars to engage with a wider range and higher number of sources or inputs than they could consider via traditional research methods.

Functions can be very simple or incredibly complex. Think of functions as blocks of instructions that produce wanted outcomes.

Start by using the def command to declare the function name, add parameters in parentheses, and end the line with a colon:def (parameter_1, parameter_2, etc.):
- A parameter/argument is information that can be passed back through the function when it is called. You can add any number of parameters in the parentheses divided by commas. Parameter is typically the term used when defining a function, and, when called, you enter in arguments.
Add indented statements that entail what the functions should execute, including desired outputs (i.e. print(), show(), return()). There can be multiple outputs.
Once all instructions have been written, un-tab the line and call the function. The function can be imported and called from other locations.

Simple Example Function

Function: function_name(filename)
Purpose: When called, the function opens a file inputted as the parameter filename, reads the text, store it as variable text, and then closes the file
Parameters:
- filename: can be any source file saved in the same environment

#declare the function:
def function_name(filename):
    file = open(filename, 'rt')
    text = file.read()
    file.close()

#call the function:
function_name('file_input_as_parameter.txt')

Building a function to augment historical research

To augment my historical research, I wanted to leveraged data analysis and visualization functionality to investigate gems in text-based records. The following steps outline how I built a gem_frequency_graph function. The function produces a bar chart of the frequency/count gemstones appearance in any inputted file. See this post for a more complex discussion regarding functions that augmented my analysis of the gems and pearls in early modern history.

1. Declare the function and parameters, import any needed libraries/modules

Function: frequency_gem_graph(filename, title_of_file, author_date)
Purpose: The function produces a graphic representation of how many times different gems (diamonds, sapphires, pearls, etc.) are mentioned in an inputted text.
Parameters:
- filename: primary source file that function analyzes for frequency of gem mentions; file should be saved as a .txt file in the same environment See this post for more details on digital history sources.
- title_of_file: enter the title of the primary source file as a string. This string element populates a portion of the title() function from the matplotlib library that sets the heading for the bar-plot
- author_date: enter bibliographic information (author, editors, translators, dates, etc.) of the primary source file as a string (i.e. ('Jean-Baptiste Tavernier, 1678'). This string element populates a portion of the title() function from the matplotlib library that sets the sub-heading for the bar-plot

After naming and defining the gem_frequency_graph function, it is best practice to importing any necessary libraries/modules. This includes importing other function that you have built previously. For example, I pre-built file_function() that cleans up my historical source input.

#define your function with clear parameters and import needed libraries
def gem_frequency_graph(filename, title_of_file, author_date):
    from clean_up_text import file_function
    import matplotlib.pyplot as plt
    from nltk.probability import FreqDist
    import pandas as pd
    import seaborn as sns
    from nltk.stem import PorterStemmer

#instead of continually redoing clean up, call a built function 
#with the argument from new function of (filename) 
   text_final = file_function(filename)

2. Add statements that the function should execute

For the gem_frequency_graph function, I wanted to compare how frequently different gems are mentioned in specific text. I augmented a large (cleaned-up) list: text_final to ensure that any variations or plural spellings of the different gems would be counted (i.e. that 'pearls' would be counted in the final count for pearl). I did equate spinel to ruby, though there are debates around this topic. This decisions could easily be changed or even added as its own, permeable parameter to the function.

From there, I leveraged the pandas library to build a frequency distribution dataframe. This changes list data into table form. If you added the output command print(all_fdist) to the function, you would view a table with each gem in the gem_list and its count for the inputted text you called through the function in the terminal.

#augmenting the list to build our frequency dist function
    text_final = [i.replace('pearls', 'pearl') for i in text_final]
    text_final = [i.replace('jewels', 'jewel') for i in text_final]
    text_final = [i.replace('gemstone', 'gem') for i in text_final]
    text_final = [i.replace('gems', 'gem') for i in text_final]
    text_final = [i.replace('rubies', 'ruby') for i in text_final]
    text_final = [i.replace('spinel', 'ruby') for i in text_final]
    text_final = [i.replace('emeralds', 'emerald') for i in text_final]
    text_final = [i.replace('corals', 'coral') for i in text_final]
    text_final = [i.replace('diamonds', 'diamond') for i in text_final]
    text_final = [i.replace('ambers', 'amber') for i in text_final]
    text_final = [i.replace('sapphires', 'sapphire') for i in text_final]
    text_final = [i.replace('jades', 'jade') for i in text_final]
    text_final = [i.replace('turquoises', 'turquoise') for i in text_final]
    text_final = [i.replace('ivories', 'ivory') for i in text_final]
    text_final = [i.replace('garnets', 'garnet') for i in text_final]

   gems = ['pearl', 'ruby', 'jewel', 'emerald', 'coral', 'gem', 'diamond', 'sapphire', 'turquoise', 'jade', 'amethyst', 'topaz', 'opal', 'ivory', 'amber', 'catseye', 'alexandrite', 'garnet', 'peridot']
   gem_list = [w for w in text_final if w in gems]

#build frequency distribution dataframe
    all_fdist = FreqDist(gem_list)
    all_fdist = pd.Series(dict(all_fdist)).sort_values(ascending=False)

Then, I used the table from the dataframe all_fdist as an input into the seaborn bar graph (sns.barplot). This transforms the data into a visual representation of the findings!

#build graph; establish your color, format, and other specifications
    sns.set()
    sns.color_palette("husl", 8)

    fig,ax = plt.subplots(figsize=(8,8))
    all_plot = sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)
    sns.despine()
    plt.xticks(rotation=30)
    plt.ylabel('Frequency (Count)', fontsize=8)
    plt.xlabel('Gem', fontsize=8)
    plt.suptitle('Count mentions of various gemstones in ' + title_of_file, fontsize=14)
    plt.title( 'by ' + author_date, fontsize = 10)

#display the graph
    plt.show()

3. Call the function

The above code (building the gem count list, the dataframe, the chart, etc.) is indented and, thus, instructions built into the defined gem_frequency_barchart function. In a new, unindented line, you can call the function. Call the function by typing the function name and correctly inputting the arguments/parameters in parentheses.

#example calling the gem frequency bar chart function
gem_frequency_chart('jahangir_wheeler_thackston_translation.txt', '',
 'Emperor Jahangir - Translated by Wheeler Thackston')

The above parameters outputs the following graphic representation:

Research, Test, Refine, and Iterate

For my inputs, I practiced using two translations of the Jahangirnama and Jean-Baptiste Tavernier's The Six Voyages of John Baptista Tavernier. Additional text files could be called through the same code by saving the files in the same environment and updating the parameters. The example function is, by no means, perfect. As I continue to expand my python skills, I can alter and improve my functions. With every change, I can run the same or new historical texts through it to see the data and end results.

Functions are easily scalable, repeatable, and permeable. As many programmers or individuals in the technology space know, the clearest path to a better end result is through iteration! After researching or thinking about what you may want to code, try to build it. Iterate your functions, test them often, improve and tweak as necessary, or start new if something is not turning out the way you want it to! The possibilities are limitless.

Happy coding!

Bibliography:

Nur al-Din Jahangir (Jahangir Emperor of Hindustan). The Jahangirnama: Memoirs of Jahangir. Emperor of India. Translated, edited, and annotated by Wheeler M. Thackston. New York: Oxford University Press, 1999.

Tavernier, Jean-Baptiste. The Six Voyages of John Baptista Tavernier. London: Printed for R.L. and M.P., 1678.

Establishing and transforming your digital history sources

Natasha — Sun, 10 Oct 2021 03:47:02 GMT

This blog post has three main goals:

Review potential historical source that can be used as digital inputs for your programs
Establish how to open text digital input in your code
Explore how to transform and clean up your text inputs (removing stopwords, cleaning up formatting, and other considerations)

This post attempts to provide a simple and digestible overview of what to consider when thinking of what historical sources could be used your programs. However, it does not cover the basic information regarding installing python3, setting up a virtual environment, installing libraries, or defintions for common terminology (lists, strings, etc.).

Historical Sources as Digital Inputs

To conduct code-based research analysis, you need to find (or create) digital source inputs. The potential inputs you can use are really limitless. Analysis can be performed on text files, data tables, digitally-uploaded images, auditory files, and more modalities.

The following focuses on cleaning up text files. Many historical primary sources are already available online (via various efforts to digitize and archive cultural works such as Project Gutenberg, Fordham Universities Internet History Sourcebooks Project, the National Archives, etc.). For written materials, you can find and download sources as txt, csv (comma-separated values), html (HyperText Markup Language), or other formats. I found it easiest to work with the raw text files as they are easy to transform into strings, there are less notation additions, and, thus, there is less clean up required.

If you only have access to a certain format (like html), there are steps you can take to clean up your files and ensure that they can be used as an input. Explore other posts and forums to learn more about the numerous files types or inputs you can use in your analysis.

Opening Digital Input in Code

Download and add your text file to the workspace or virtual environment (venv) you are using and then you can use it in analysis.

For example, I imported the raw text of Wheeler Thackston's translation of the Jahangirnama made available by Freer Gallery of Art, Arthur M. Sackler Gallery, Smithsonian Institution (doi:10.5479/sil.849796.39088018028456 )

At this point, you can manually conduct some very high-level clean up efforts to ensure the text is ready for code-based analysis. This could include actions like manually deleting prefaces or indices at the end of the text you don't want to include in your analysis. After saving this file as jahangir_wheeler_thackston_translation.txt in my virtual environment, I am able to open it in other programs in the same environment.

#opening a file (located in the same environment) in your program 
filename = 'jahangir_wheeler_thackston_translation.txt'
file = open(filename, 'rt')
jahangirnama_text = file.read()
file.close()

Transforming and Cleaning up your Digital Input

Next, use the built-in python functionality and nltk library (ensure you have installed the library first: pip install nltk) to transform your text from a string to a list as well as generally clean up your input file. While there are many different approaches and different libraries that you can take to transform and clean up your input, the nltk library provide powerful, ready-made functions.

#best practice is to import all libraries at the top of your program
 from nltk.corpus import stopwords
 from nltk.tokenize import word_tokenize
 from nltk.stem import PorterStemmer

#transform text from a string to a list (lists are easier for analysis)
 tokens = word_tokenize(jahangirnama_text)
 text_turned_into_list = [word for word in tokens if word.isalpha()]

After creating a list, you can transform your input further by making the list all lower case, removing stopwords (a predefined list of commonly used words built into the nltk library), editing punctuation issues, or other issues could hinder you analysis.

For example, you don't want a frequency function to tell you that "the" or "a" are the most frequently used words or think that 'Pearl' and 'pearl' are different words. Removing stopwords or capitalization allows your programs to dig more deeply into the meaning and data of the text.

#make the list lowercase to avoid case sensitivities issues
 text_lower = [parts.lower() for parts in text_turned_into_list]

#use built-in nltk function to remove stopwords
 stopwords = nltk.corpus.stopwords.words('english')

#you can also easily append the stopword list to include other words 
#you think detract from your text analysis
 new_words = ['one', 'also', 'two', 'would']
   for i in new_words:
         stopwords.append(i)

 final_text = [w for w in text_lower if not w in stopwords]

Additionally, you can stem the words. This may or may not be necessary depending on the analysis you want to perform. Stemming, as the word suggests, cuts the end of words or reduces the word to its root. There are different types of stemming or lemmatization and are built into the nltk library (PorterStemmer, LancasterStemmer, etc.).

 ps = PorterStemmer()
 stemmed_words=[]
 for t in final_text:
        stemmed_words.append(ps.stem(t))

Use print() command to test the results along the way. Given we usually use primary sources that are hundreds of pages long (the Jahangirnama is around 500-pages), practice using indices to cut what is shows in your terminal.

#print the first one-hundred elements in the list in the terminal
print(final_text[:100])

You can turn all this code into a neat, reusable function. See this post for more details on building functions. !

Example File Clean-up Function (no stemming):

Function: function_name(filename)
Purpose: When called, the function opens a file inputted as the parameter filename and performs all necessary clean up tasks (i.e. tokenizes, removes stop words, etc.).
Parameters:
- filename: can be any source file saved in the same environment

def file_function (filename):
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize

    file = open(filename, 'rt')
    text = file.read()
    file.close()
    tokens = word_tokenize(text)
    text_turned_into_list = [word for word in tokens if word.isalpha()]
    text_lower = [parts.lower() for parts in text_turned_into_list]
    stopwords = nltk.corpus.stopwords.words('english')
    new_words=['i', 'also', 'much', 'would', 'by', 'another', 'could', 'thou', 'do']
    for i in new_words:
        stopwords.append(i)
    final_text = [w for w in text_lower if not w in stopwords]

#instead of print(), use the return() command in this function    
    return(final_text)

There are other commonly used functions that you can leverage to clean up your primary source (or other) inputs. Many blog post share how to clean up strings and list, though not many have a history focus! The above steps are the ones I followed to create lists from historical text that would allow me to perform analysis on textual primary sources. See this post for a more complex discussion regarding functions that augmented my analysis of the gems and pearls in early modern history.

Happy coding!