Establishing and transforming your digital history sources

This blog post has three main goals:

Review potential historical source that can be used as digital inputs for your programs
Establish how to open text digital input in your code
Explore how to transform and clean up your text inputs (removing stopwords, cleaning up formatting, and other considerations)

This post attempts to provide a simple and digestible overview of what to consider when thinking of what historical sources could be used your programs. However, it does not cover the basic information regarding installing python3, setting up a virtual environment, installing libraries, or defintions for common terminology (lists, strings, etc.).

Historical Sources as Digital Inputs

To conduct code-based research analysis, you need to find (or create) digital source inputs. The potential inputs you can use are really limitless. Analysis can be performed on text files, data tables, digitally-uploaded images, auditory files, and more modalities.

The following focuses on cleaning up text files. Many historical primary sources are already available online (via various efforts to digitize and archive cultural works such as Project Gutenberg, Fordham Universities Internet History Sourcebooks Project, the National Archives, etc.). For written materials, you can find and download sources as txt, csv (comma-separated values), html (HyperText Markup Language), or other formats. I found it easiest to work with the raw text files as they are easy to transform into strings, there are less notation additions, and, thus, there is less clean up required.

If you only have access to a certain format (like html), there are steps you can take to clean up your files and ensure that they can be used as an input. Explore other posts and forums to learn more about the numerous files types or inputs you can use in your analysis.

Opening Digital Input in Code

Download and add your text file to the workspace or virtual environment (venv) you are using and then you can use it in analysis.

For example, I imported the raw text of Wheeler Thackston's translation of the Jahangirnama made available by Freer Gallery of Art, Arthur M. Sackler Gallery, Smithsonian Institution (doi:10.5479/sil.849796.39088018028456 )

Screen Shot 2021-10-20 at 11.04.07 PM.png

At this point, you can manually conduct some very high-level clean up efforts to ensure the text is ready for code-based analysis. This could include actions like manually deleting prefaces or indices at the end of the text you don't want to include in your analysis. After saving this file as jahangir_wheeler_thackston_translation.txt in my virtual environment, I am able to open it in other programs in the same environment.

#opening a file (located in the same environment) in your program 
filename = 'jahangir_wheeler_thackston_translation.txt'
file = open(filename, 'rt')
jahangirnama_text = file.read()
file.close()

Transforming and Cleaning up your Digital Input

Next, use the built-in python functionality and nltk library (ensure you have installed the library first: pip install nltk) to transform your text from a string to a list as well as generally clean up your input file. While there are many different approaches and different libraries that you can take to transform and clean up your input, the nltk library provide powerful, ready-made functions.

#best practice is to import all libraries at the top of your program
 from nltk.corpus import stopwords
 from nltk.tokenize import word_tokenize
 from nltk.stem import PorterStemmer

#transform text from a string to a list (lists are easier for analysis)
 tokens = word_tokenize(jahangirnama_text)
 text_turned_into_list = [word for word in tokens if word.isalpha()]

After creating a list, you can transform your input further by making the list all lower case, removing stopwords (a predefined list of commonly used words built into the nltk library), editing punctuation issues, or other issues could hinder you analysis.

For example, you don't want a frequency function to tell you that "the" or "a" are the most frequently used words or think that 'Pearl' and 'pearl' are different words. Removing stopwords or capitalization allows your programs to dig more deeply into the meaning and data of the text.

#make the list lowercase to avoid case sensitivities issues
 text_lower = [parts.lower() for parts in text_turned_into_list]

#use built-in nltk function to remove stopwords
 stopwords = nltk.corpus.stopwords.words('english')

#you can also easily append the stopword list to include other words 
#you think detract from your text analysis
 new_words = ['one', 'also', 'two', 'would']
   for i in new_words:
         stopwords.append(i)

 final_text = [w for w in text_lower if not w in stopwords]

Additionally, you can stem the words. This may or may not be necessary depending on the analysis you want to perform. Stemming, as the word suggests, cuts the end of words or reduces the word to its root. There are different types of stemming or lemmatization and are built into the nltk library (PorterStemmer, LancasterStemmer, etc.).

 ps = PorterStemmer()
 stemmed_words=[]
 for t in final_text:
        stemmed_words.append(ps.stem(t))

Use print() command to test the results along the way. Given we usually use primary sources that are hundreds of pages long (the Jahangirnama is around 500-pages), practice using indices to cut what is shows in your terminal.

#print the first one-hundred elements in the list in the terminal
print(final_text[:100])

You can turn all this code into a neat, reusable function. See this post for more details on building functions. !

Example File Clean-up Function (no stemming):

Function: function_name(filename)
Purpose: When called, the function opens a file inputted as the parameter filename and performs all necessary clean up tasks (i.e. tokenizes, removes stop words, etc.).
Parameters:
- filename: can be any source file saved in the same environment

def file_function (filename):
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize

    file = open(filename, 'rt')
    text = file.read()
    file.close()
    tokens = word_tokenize(text)
    text_turned_into_list = [word for word in tokens if word.isalpha()]
    text_lower = [parts.lower() for parts in text_turned_into_list]
    stopwords = nltk.corpus.stopwords.words('english')
    new_words=['i', 'also', 'much', 'would', 'by', 'another', 'could', 'thou', 'do']
    for i in new_words:
        stopwords.append(i)
    final_text = [w for w in text_lower if not w in stopwords]

#instead of print(), use the return() command in this function    
    return(final_text)

There are other commonly used functions that you can leverage to clean up your primary source (or other) inputs. Many blog post share how to clean up strings and list, though not many have a history focus! The above steps are the ones I followed to create lists from historical text that would allow me to perform analysis on textual primary sources. See this post for a more complex discussion regarding functions that augmented my analysis of the gems and pearls in early modern history.

Happy coding!

Establishing and transforming your digital history sources

More from this blog

Using Python to trace threads through history

Creating scalable and repeatable functions to augment historical research

Historical Sources as Digital Inputs

Opening Digital Input in Code

Transforming and Cleaning up your Digital Input

Example File Clean-up Function (no stemming):

Command Palette

More from this blog

Historical Sources as Digital Inputs

Opening Digital Input in Code

Transforming and Cleaning up your Digital Input

Example File Clean-up Function (no stemming):