📊Lab 4: Data Visualization

CS 1004 ~ Prof. Smith

Learning Objectives

You’ve come a long way in just a few weeks! So far, you’ve been gaining comfort with: drawing, animation, calling functions, user input through keyboard and mouse, variables and global scope, conditionals, loops, arrays, and images. Through working on this lab assignment, you should further develop these existing skills, as well as begin to gain comfort with:

Manipulating strings (a data type used to store text)
Reading data from an external file
Creating and modifying objects (sometimes referred to as “dictionaries”)

To complete this lab, you will need to read data from a file. As you work on it, I encourage you to also keep course-wide learning objectives in mind, and in particular reflect upon: who created this data? what does it let you explore? what questions cannot be answered? what assumptions are you making?

Choose the Data

All computer programs manipulate data. Some data is defined in the program (the variables you’ve chosen in previous lab assignments, for example). Much of the time, though, programs operate upon externally provided data. Data comes from many different sources, but all data is artificially constructed. A human (or set of humans) decides what is worth saving and sharing with others. A human formats the data, decides what to include/exclude, and how to organize it.

Your first step in the lab is deciding what data you want to visualize. I recommend choosing a book (or two, or three, or more!) from Project Gutenberg (the example code uses Jane Austen’s Pride and Prejudice). Project Gutenberg provides easy access to text files of tens of thousands of books that are in the public domain. If you wish, you may also choose to collect and use your own data, such as a collection of emails, syllabi for your courses, social media posts, or text messages.

Take some time to think about the broader context of the data you have selected. Who created it? What does it let you explore? What questions can you answer with it, and what cannot be answered? What assumptions are you making about the data?

Question the Data

Take some time to think about what you want to learn from this dataset. For example, in the sample code in this lab, I wanted to look at how often characters are named in the book, and whether it corresponds to a readers’ judgment of character importance/voice. It asks a simple question (appropriate for the starter code!): How often is each character named in the book?

Write down your question. It’s okay if the question changes. You still need to start with something.

Clean the Data

Open up your text file. Does it look as you expect it to? The first step in all data visualization projects is going through a process called “data cleaning”. This is the process of preparing a data file to be in a format that can be easily and efficiently parsed by your code

For the copy of Pride and Prejudice I downloaded, I needed to delete the front matter and copyright information at the end. This information is very important, but lacking any other easy way to tell my code where the book content begins, I need to get rid of it so that the only text my code will process is the book content itself. There’s a lot of other things I could have done that I didn’t bother with for this example that definitely impacts the output, such as:

Removing formatting characters (e.g. Project Gutenberg puts underscores around originally italicized words)
Making sure every new line in the document was one and only one sentence (one way to do this is to find-and-replace all “. “ in the document with “.\n”)
Removing chapter header information

If you’re using a scientific dataset, you might need to do things like:

Isolate the specific data you are going to use into a single file, and delete the extraneous data
Convert your data file to a format that’s easy for p5js to parse (typically, delimiting information with newlines and commas, spaces, or tabs)
Deleting duplicate entries
Confirming that the dataset follows consistent rules and fixing internal inconsistencies (for
example, a common issue in COVID testing datasets early in the pandemic was inconsistency in test reporting: some places would log it by the date it was taken, others by the date the result returned)

What do you need to clean up before you can get started with your project?

If you’re feeling like this is a lot of effort before you even get to start programming, that’s normal: finding an appropriate dataset, figuring out what you can actually do with the dataset you have, and cleaning the data to prepare it for analysis is something that needs to be done before every data visualization project.

Remember: data is artificially constructed, by humans. What you do in the data cleaning step matters.

Parse the Data

Now that you’ve got your text file all set up and ready to go, it’s time to write the code for parsing that data into the format you need in the code. In the example code, there are several steps early in the code that build the words array and the frequencies object.

First, make sure you load the data:

function preload() {
    data = loadStrings("pride_and_prejudice.txt");
    }

loadStrings(filename) creates an array where each element of the list corresponds to a single line of the text file.

But that doesn’t get me what I need, which is a list where every element is a single word in the file. So, there’s some more processing to do.

function setup() {
  createCanvas(400, 400);
  words = [];
  frequencies = {};
  max_frequency = -1;
  min_frequency = 10000000;

  //go through every line of the input text file
  for (var i = 0; i < data.length; i++) {
    //split that input file into words (space separator)
    var split = data[i].split(" ");

    //now go through each individual word!
    for (var j = 0; j < split.length; j++) {
      //take all the punctuation off the word
      //store the result in the 'w' variable
      var w = stripPunctuation(split[j]);

      //if the word is a character name we care about
      //add it to the frequencies object
      if (characters.includes(w)) {
        words.push(w);
        //if it's already there, increment
        if (w in frequencies) {
          frequencies[w]++;
        }
        //otherwise, the count starts at 1
        else {
          frequencies[w] = 1;
        }
      }
    }
  }

For every entry in that data list, we:

split the string into a list of words separated by a space
stripPunctuation from the entire word
push the resulting word to the words list (but only if it is one of the character names)

And then, we increment the count of that word in the frequencies object, which is indexed by the word. That’s a bunch of pre-processing! But now we’ve got data we can actually do something with.

Analyze and Visualize the Data

Now it’s time for you to write the code that aims to answer the question you created in Step 1. You can split this into two phases:

analysis: creating additional variables, lists, and structures that either transform the data into a new format, or store information you derive from the original data
visualization: using the variables you’ve created during the analysis step to create graphical elements that correspond to the data and answer the question(s) you have
about the data

As a general rule, you should put all of your analysis code in the setup() function and all of the visualization code in the draw() function. That’s because you only need to do the analysis step once! For the analysis phase, take a look at the array and string methods in the p5.js reference, which are the most common functions you are likely to need. Remember that strings are case sensitive!

For the visualization phase, think about how the data you have can map to drawing primitives. In general, you will be mapping data onto one of the following attributes of your drawing: fill color, stroke color, position, weight, or number of shapes. You can also think about adding in user interactivity; for example, clicking on symbols to see more about the underlying data.

It is, as always, completely acceptable to modify the sample code provided to meet your needs.

Reflect on What You Learned

Take some time to look back at the question you set at the beginning of this lab. Was it easy or hard to answer? While programming, did you find any new assumptions you had made without realizing? How, if at all, did the context surrounding the data you selected inform your visualization? How do you think it could/should have informed it?

Then, write an additional paragraph describing what you think you’ve learned from this assignment, both in relation to the learning objectives described at the beginning of the lab assignment, and more broadly related to the course goals. What do you know now that you didn’t know before starting the lab? If you were to start this lab again knowing what you know now, would you do anything differently? What did you find difficult and/or easy about this lab? Is there more you want to learn?

Turn It In

Save your final p5js sketch with the naming convention: lastname_firstname_lab4

Then, submit the following:

A zip file of the directory containing your sketch for your lab assignment (remember to zip the full directory, not just the directory contents) – this will include the data file! If you have used personal data (e.g. your own email archive), you may remove it before submission, but make sure you mention that you have done so in your reflection.
A link to the working sketch, if you used the p5js editor
Either:
- screenshots or video demonstrating the output of your lab (if the code works) OR
- a brief description of what you think is wrong with your code (if the code doesn’t work)
Your reflection on creating a data visualization and what you have learned this week

Supplemental Files

Lab 4 sample code is available online or to download:

942KB

cs_1004_lab_4_sample_complete.zip