CS 4501 / 6501 Analyzing Online Behavior for Public Health Prof. Henry Kautz <henry.kautz@virginia.edu> Monday & Wednesday 3:30-4:45pm Mechanical Engr Bldg 341 |
People’s online behavior contains signals about their physical and mental health. This course will explore research on using data from users’ interactions with Twitter/X, Google Search, YouTube and other online platforms for tasks ranging from identifying people suffering from anxiety disorder to tracking down restaurants that are sources of food poisoning. We will also read papers on both sides of the ongoing debate about whether social media should be restricted because of potential harm to children or adults.
CS 2100 or permission of the instructor.
Students will be required to read up to 4 research papers each week and write a 1 page summary of each. These summaries should be written manually by the students without using any AI writing or summarization tools. In addition, for each class session, two students (or depending upon class enrollment, two pairs of students) will present their summaries and lead a discussion of the work. The written summaries and presentations should include
The research hypothesis of the work
The methods employed
A critical discussion of the results, including strengths and weaknesses
Ethical concerns
Open questions raised but left unaddressed by the work
The presentations can make use of Powerpoint or other media at the option of the presenters. The written summaries should be turned in using UVA Canvas.
The course programming projects in which students will download and analyze part of their own online footprint. The easiest data users can access is their Google Takeout data, which includes their browsing history, location history, search history, and YouTube viewing history.
Students who already use Google services should consider turning on the “Timeline” feature of Google Maps immediately upon registering for the course (if it is not already on) in order to ensure that they have a rich set of location data to analyze. Students who do not use Google services should contact the professor to talk about what other kinds of online data about themselves they could access.
Please note that students’ data will not be shared with the professor or other students; they will only be expected to include summaries and visualizations of the data that they create themselves in their project reports.
You are encouraged to make use of large language models such as ChatGPT for help coding. You are also free to make use of public code respositories. All such use of LLMs or public repositories should be cited in your report.
Both CS 4501 and CS 6501 students will complete the first two projects:
Use your Google Takeout data to infer your sleep patterns for at least a month, under the assumption that these can be determined by your lack of use of Google applications. Create visualizations of the data. Write a report of at least 1500 words discussing your methods, results, and insights gained.
User your Google Takeout Map Timeline data to infer the significant locations in your life, that is, places you repeatedly visit over the course of a month, including your home, your classrooms, favorite restaurants, and similar. Determine how to write rules that automatically label the types of significant locations based on features about locations you can obtain using the Google Places API and include in your report how accurate your rules are. Create visualizations of the data. Write a report of at least 1500 words discussing your methods, results, and insights gained.
Only CS 6501 is required to complete a third project; it is optional for CS 4501 students.
Create a research hypothesis about the relationship between at least two different kinds of personal online data and events in your life. The online data can be all from Google Takeout or can include data from other sources such as Instagram. For events in your life, manually create a timeline in JSON for a three month period that includes events such as beginning of classes, exams, travel, illnesses, and other significant events during that period. Design and carry out an experiment to test your hypothesis and analyze the results. The analysis can take the form of a summary of statistical correlations or a machine learning predictive model (or both):
To understand how to compute correlations between your data streams, ask Dr. ChatGPT, "How do I compute correlations between time series data?" and "How do I compute time series correlations for discrete events?"
A machine learning predictive model would use data from one or more sources to predict events in another source. You can use a Python library such as scikit-learn to create the model and compute its accuracy. Use cross validation to make your results more reliable.
Write a report of at least 2000 words discussing your methods, results, and insights gained.
Using AI tools to create any paper summaries will be taken as academic dishonesty. Passing off someone else’s work as your own without acknowledgement will be taken as academic dishonesty. All cases of suspected academic dishonesty will be referred to the UVA Honor Office.
Grades will be based on
25% written paper summaries. These must be turned in no later than 48 hours after the class in which the corresponding paper was discussed, or zero credit will be assigned for that summary. Students will be forgiven failure to turn in 2 written summaries. Students who have planned absences from campus are encouraged to turn in their paper summaries in advance.
25% on paper presentations and leading the class discussions.
50% on programming projects and reports. Turning in an assignment late will result in a 25% penalty.