CS 4501 / 6501 Analyzing Online Behavior for Public Health Prof. Henry Kautz <henry.kautz@virginia.edu> Monday & Wednesday 3:30-4:45pm Mechanical Engr Bldg 341 |
People’s online behavior contains signals about their physical and mental health. This course will explore research on using data from users’ interactions with Twitter/X, Google Search, YouTube and other online platforms for tasks ranging from identifying people suffering from anxiety disorder to tracking down restaurants that are sources of food poisoning. We will also read papers on both sides of the ongoing debate about whether social media should be restricted because of potential harm to children or adults.
Please visit this Google sheet for the the calendar of readings and presentations.
CS 2100 or permission of the instructor.
Students will be required to read up to 4 research papers each week and write a 1 page summary of each. These summaries should be written manually by the students without using any AI writing or summarization tools. They are due on the day the paper is discussed in class.
In addition, pairs of students will be assigned to present the papers and lead class discussions. The pair of students should work together to create a single Powerpoint (or other presentation software) presentation of about 25 minutes duration. They should strive to thoroughly understand the paper, which may involve reading others papers cited by the paper being presented or learning about the machine learning algorithms the authors used so they can explain them to the class. Students should contact each other at least a week before their presentation to make sure they have time to meet. They should practice the presentation before giving it to the class.
Presenters are responsible for checking the class calendar to know the date they have been assigned. A student who will not be able to attend class on their assigned date should inform the instructor as soon as possible so that they can be assigned to a different date. In addition, if a student decide to drop the class before their presentation, they should inform the instructor as soon as that decision is made so that he can revise the calendar.
The written summaries and presentations should include
The research hypothesis of the work
The methods employed
A critical discussion of the results, including strengths and weaknesses
Ethical concerns if any
Two questions about the work that you would be prepared to ask during the discussion of the paper.
Programming Projects
For the course programming projects, students will download and analyze part of their own online footprint. The easiest data users can access is their Google Takeout data, which includes their browsing history, location history, search history, and YouTube viewing history.
Students who already use Google services should consider turning on the “Timeline” feature of Google Maps immediately upon registering for the course (if it is not already on) in order to ensure that they have a rich set of location data to analyze. Students who do not use Google services should obtain some other kind of mapping software that will allow them to download their movement history.
You are encouraged to make use of large language models such as ChatGPT for help coding. You are also free to make use of public code respositories. All such use of LLMs or public repositories must be cited in your report.
Both CS 4501 and CS 6501 students will complete the first two projects:
Use your Google Takeout data to infer your sleep patterns for at least a month, under the assumption that these can be determined from your use of Google applications. Create visualizations of the data. Write a report of at least 1500 words discussing your methods, results, and insights gained. Be prepared to give a 2 minute "lightning talk" (one slide) about your project.
Use your Google Takeout Map Timeline data (or similar) to infer the significant locations in your life, that is, places you repeatedly visit over the course of a month, including your home, your classrooms, favorite restaurants, and similar. Your program will need to distinguish quickly moving through a place and spending some amount of time at a significant location. Note that due to variations in GPS readings, you will need to implement a method to cluster nearby readings. Implement an algorithm that automatically label the types of significant locations based on features about locations you can obtain using the Google Places API and include in your report how accurate your algorithm is. Create visualizations of the data. Write a report of at least 1500 words discussing your methods, results, and insights gained. Be prepared to give a 2 minute "lightning talk" (one slide) about your project.
Only CS 6501 is required to complete a third project; it is optional for CS 4501 students. Students may work on the project alone or in partnership with another student. If you decide to partner with another student, please be sure that you are comfortable sharing your data with that person. You are by no means compelled to share your data with anyone else in order to complete this course! Teams of three students might also be possible, but you should obtain permission from the instructor in advance, and the scale and depth of your project should be larger than the minimum requirements.
Create a research hypothesis about the relationship between at least two different kinds of personal online data and events in your life. The online data can be all from Google Takeout or can include data from other sources such as Instagram. For events in your life, manually create a timeline in JSON for a three month period that includes events such as the beginning of classes, exams, travel, illnesses, and other significant events during that period. Design and carry out an experiment to test your hypothesis and analyze the results. The analysis can take the form of statistical correlations or a machine learning predictive model (or both).
To understand how to compute correlations between your data streams, ask Dr. ChatGPT, "How do I compute time series correlations for discrete events?"
A machine learning predictive model would use data from one or more sources to predict events in another source. You can use a Python library such as scikit-learn to create the model and compute its accuracy. If you are work on the project by yourself, you will need to separate you data into a training set and a testing set. If you are working in a team, you can try trying on one person's complete data and then using the model for prediction using the other person's data. Use cross validation to make your results more reliable.
A challenge for using machine learning for this project is the relatively small amount of data you will have for yourself. You might want to find a way to leverage public databases of social media data to create a larger training set (where you could give higher weight to your own data by repeating it several times). One collection of social media posts is the Stanford Stanford Large Network Dataset Collection (SNAP). Another approach may be to use a pretrained large language model such as ChatGPT to do what is called "zero-shot" learning
Write a report of at least 2500 words discussing your methods, results, and insights gained. Be prepared to give a 5 to 10 minute talk about your final project.
Using AI tools to directly create the paper summaries will be taken as academic dishonesty. You may, however, use an AI such as ChatGPT to help explain concepts from the paper you don't understand, but in such cases include a note in your summary saying how the AI helped you. You may use code in your projects that you find in public repositiories but such use must be cited in yiour report. All cases of suspected academic dishonesty will be referred to the UVA Honor Office.
Grades will be based on
25% written paper summaries. These should be turned in at the class class in which the paper is being discussed. Hardcopy is preferred. If you cannot attend in person, you may use the Canvas turn in feature to turn in the summary. Summaries may not be turned in late. Students are forgiven up to two missing summaries during the semester.
25% for your paper presentation.
50% on programming projects and reports. Turning in an assignment late will result in a 25% penalty.