cs6501-4501-analyzing-online-behavior

CS 4501 / 6501 Analyzing Online Behavior for Public Health
Monday & Wednesday 3:30-4:45pm
Mechanical Engr Bldg 341

Prof. Henry Kautz <henry.kautz@virginia.edu>
Office Hour: Monday 2-3pm Rice 511

TA Sherharyar Khalid
Office Hour: Wednesday 1-2pm Rice 422

TA Wenqian Ye
Office Hour: Tuesday 4-5pm via Zoom (https://virginia.zoom.us/j/5888648735)

Description

People’s online behavior contains signals about their physical and mental health. This course will explore research on using data from users’ interactions with Twitter/X, Google Search, YouTube and other online platforms for tasks ranging from identifying people suffering from anxiety disorder to tracking down restaurants that are sources of food poisoning. We will also read papers on both sides of the ongoing debate about whether social media should be restricted because of potential harm to children or adults.

Course Calendar

Please visit this Google sheet for the the calendar of readings and presentations.

Prerequisites

CS 2100 or permission of the instructor.

Readings and Presentations

Students will be required to read up to 4 research papers each week and write a 1 page summary of each. These summaries should be written manually by the students without using any AI writing or summarization tools. They are due on the day the paper is discussed in class.

In addition, pairs of students will be assigned to present the papers and lead class discussions. The pair of students should work together to create a single Powerpoint (or other presentation software) presentation of about 25 minutes duration. They should strive to thoroughly understand the paper, which may involve reading others papers cited by the paper being presented or learning about the machine learning algorithms the authors used so they can explain them to the class. Students should contact each other at least a week before their presentation to make sure they have time to meet. They should practice the presentation before giving it to the class.

Presenters are responsible for checking the class calendar to know the date they have been assigned. A student who will not be able to attend class on their assigned date should inform the instructor as soon as possible so that they can be assigned to a different date. In addition, if a student decide to drop the class before their presentation, they should inform the instructor as soon as that decision is made so that he can revise the calendar.

The written summaries and presentations should include

The research hypothesis of the work
The methods employed
A critical discussion of the results, including strengths and weaknesses
Ethical concerns if any
Two questions about the work that you would be prepared to ask during the discussion of the paper.

Programming Projects

For the course programming projects, students will download and analyze part of their own online footprint. The easiest data users can access is their Google Takeout data, which includes their browsing history, search history, and YouTube viewing history.

You will need to use Chrome as the default browser on your phone. Please install it if necessary. Sign into it with the Google account that you will use for the projects. When time comes to download your data, launch Chrome on your computer and sign into it with the same account. You can then use Google Takeout to access the data gathered on your phone.

You will need to activate Google Maps Timeline on your iPhone or Android phone as soon as you can so that the phone starts collecting the data you will need for the second project. (Please see details of Project 2 below if you do not want to use Google Maps, because you will need to install some other tracking app and start using it immediately.) The following instructions are for the latest versions of iOS and Android. If your phone runs older versions of the operating systems, you may need to figure out how to modify them to make them work for you.

To activate Timeline on an iPhone or Android phone:

Open Google Maps.
Tap your icon on the top right.
Tap Settings.
Scroll down and tap on Personal Content.
On Iphone, In the middle of this page, it should say "Location Services is On", or on Android, under Location settings, it should say "Location is On". In either case, if it says something else, such as "Location Services is not set to Always" or "Location is off", tap those words. This will take you to the Location Services subscreen for Google Maps in the phone's main Setting app. Select Always. Touch the return arrow at the top left of the screen. You should be back in the Personal Content screen with the message "Location Services is On" is displayed.
Click on Timeline settings.
Turn on Timeline (or check that it is already on).
Optionally, turn on Backup. This backup is only useful if you change phones, you cannot access the backed up data.
Check the Auto-delete Timeline setting and make sure it’s set to NOT auto-delete.

You are encouraged to make use of large language models such as ChatGPT for help coding. You are also free to make use of public code respositories. All such use of LLMs or public repositories must be cited in your report.

Both CS 4501 and CS 6501 students will complete the first two projects. Students may work on the project alone or in partnership with another student. Groups can include three people with permission of the instructor. Group projects are expected to include some more work than than a single person project - for example, a group project could compare the data between students as well as looking at the aggregate data from everyone. If you decide to be part of a group, please be sure that you are comfortable sharing your data with the others in the group. You are by no means compelled to share your data with anyone else in order to complete this course.

Project 1

Use your Google Takeout data to infer your sleep patterns for at least a month, under the assumption that these can be determined from your use of Google applications. Create visualizations of the data using software such as Excel or scientific graphing programs. Write a report of at least 1500 words discussing your data sources, hypotheses (if any), methods, results, and insights gained. Be prepared to give a 2 minute "lightning talk" (one slide) about your project.

Your project will be strengthened if you can measure how accurately the data allows you to measure your sleep schedule. Consider keeping a written record of when you go to bed and when you wake up for a few weeks. If you use a sleep-tracking smartwatch or activity tracker, use its data as ground truth.

The Google Takeout data most relevant to this project is your Chome browsing history, which contains your search history as well, and your YouTube history. On the Google Takeout page, first deselect all categories, and then select Chrome and "YouTube and YouTube Music". You can select whether to receive an email with a link when the data is ready to download or to have it automatically added to your Google Drive. If you stay on the Takeout page, you will also be able to access the data your "Your latest export". You will need to figure out (with help from ChatGPT!) how to read the JSON files, how to convert the timestamps into clock time and dates, and (if you choose to do so) how to distinguish searches from other web pages browsed.

Unfortunately, Google recently removed the ability to specify a time range for Takeout, so you will get your complete history. Note that it can take hours or days for Google to create the takeout zip file(s), so be sure to get started on this as soon as you can.

If you do not use Google services, you will need to find an alternative source of data. For example, if you use Safari on an iPhone, you could download your browsing history and use the timestamps of visited pages. You can download the data directly on the iPhone, or if you sync Safari with iCloud, from Safari on your MacBook.

Place the code for your project in a public GitHub repository, and include the URL for the repository in your report.

Upload the 1 slide summary by Sunday February 23 at 11:59pm. Use the names of all the people in your group as the filename. The slide must be in Microsoft Powerpoint format. If you use different presentation software, export it to pptx format or export as an image and paste into an a Powerpoint slide. Only one member of your group need upload it.

Upload the report by Monday February 24 at 11:59pm. This is the day for the lightning presentations. Use the names of all the people in your group as the filename. The report should be PDF (preferred) or Microsoft Word.

Project 2

Use your Google Takeout Map Timeline data to infer the significant locations in your life, that is, places you repeatedly visit over the course of a month, including your home, your classrooms, favorite restaurants, and similar. Your program will need to distinguish quickly moving through a place and spending some amount of time at a significant location. Note that due to variations in GPS readings, you will need to implement a method to cluster nearby readings. Implement an algorithm that automatically label the types of significant locations based on features about locations you can obtain using the Google Places API and include in your report how accurate your algorithm is. Create visualizations of the data. Write a report of at least 1500 words discussing your data sources, hypotheses (if any), methods, results, and insights gained.. Be prepared to give a one minute "lightning talk" (one slide) about your project.

If you do not use Google Maps, you will need to find and install an alternate program on your phone that continuously records your location and which can export the data. There are many tracking apps on the Google and Apple app stores, but you are responsible for finding one that works, paying for any subscription cost, and gathering at least a month of data. You cannot depend on Apple Maps for gathering the data you will need.

Google Maps Timeline is no longer available through Google Takeout, but must be downloaded directly from your phone.

To download Timeline data from an iPhone or Android phone:

Open the Google Maps app on your phone.
Tap on your icon at the top right.
Tap on Settings.
Tap on Personal content.
Tap on Export Timeline data.
Select the destination for the file. I suggest Google Drive, which will appear if you have installed the Google Drive app on your phone and signed into it.
The resulting file will be named location-history.json.

There is a free tier for Google Places that allows 1,000 queries per 24 hours. An alternative to Google Places is Foursquare. When you sign up for a free developer account you get $200 in credits. This should be more than enough for you to complete your project - just be careful not immediately run thousands of GPS locations through it before you have debugged your code and burn through your credit.

Place the code for your project in a public GitHub repository, and include the URL for the repository in your report.

Upload the 1 slide summary by Sunday March 30 at 11:59pm. Use the names of all the people in your group as the filename. Include your names as well on the slide itself. The slide must be in Microsoft Powerpoint format. If you use different presentation software, export it to pptx format or export as an image and paste into an a Powerpoint slide. Only one member of your group need upload it.

Upload the report by Monday March 31 at 11:59pm. This is the day for the lightning presentations. Use the names of all the people in your group as the filename. Include your names on the first page of the report. The report should be PDF (preferred) or Microsoft Word.

Project 3

Only CS 6501 is required to complete a third project; it is optional for extra credit for CS 4501 students.

Create a research hypothesis about the relationship between at least two different kinds of personal online data and events in your life. The online data can be all from Google Takeout or can include data from other sources such as Instagram. For events in your life, manually create a timeline in JSON for a three month period that includes events such as the beginning of classes, exams, travel, illnesses, and other significant events during that period. Design and carry out an experiment to test your hypothesis and analyze the results. The analysis can take the form of statistical correlations or a machine learning predictive model (or both).

To understand how to compute correlations between your data streams, ask Dr. ChatGPT, "How do I compute time series correlations for discrete events?"
A machine learning predictive model would use data from one or more sources to predict events in another source. You can use a Python library such as scikit-learn to create the model and compute its accuracy. If you are work on the project by yourself, you will need to separate you data into a training set and a testing set. If you are working in a team, you can try trying on one person's complete data and then using the model for prediction using the other person's data. Use cross validation to make your results more reliable.
A challenge for using machine learning for this project is the relatively small amount of data you will have for yourself. You might want to find a way to leverage public databases of social media data to create a larger training set (where you could give higher weight to your own data by repeating it several times). One collection of social media posts is the Stanford Stanford Large Network Dataset Collection (SNAP). Another approach may be to use a pretrained large language model such as ChatGPT to do what is called "zero-shot" learning

Write a report of at least 2500 words discussing your discussing your data sources, hypotheses, methods, results, and insights gained. Place the code for your project in a public GitHub repository, and include the URL for the repository in your report. Be prepared to give a 5 minute talk about your final project. You will use your own laptop for the talk, we will not be gathering slides.

Final project talks will given on Monday April 21, Wednesday April 23, and Thursday April 28. You must be prepared to give your talk by the first date, Monday April 21.

Reports should be turned in online by Sunday May 4 at 11:59pm. Include the names of all members of your team in the filename and on the first page of the report itself.

Academic Honesty

Using AI tools to directly create the paper summaries will be taken as academic dishonesty. You may, however, use an AI such as ChatGPT to help explain concepts from the paper you don't understand, but in such cases include a note in your summary saying how the AI helped you. You may use code in your projects that you find in public repositiories but such use must be cited in yiour report. All cases of suspected academic dishonesty will be referred to the UVA Honor Office.

Grading

Grades will be based on

25% written paper summaries. These should be turned in at the class class in which the paper is being discussed. Hardcopy is preferred. If you cannot attend in person, you may use the Canvas turn in feature to turn in the summary. Summaries may not be turned in late. Students are forgiven up to two missing summaries during the semester.
25% for your paper presentation.
50% on programming projects and reports. Turning in an assignment late will result in a 25% penalty.