Back to Forum Back to Top


Social Media and Vulnerable Populations

HIV social network social media twitter

Back to Forum

Browsing: Social Media and Vulnerable Populations

Camille Nebeker

Posts: 56

Any recommendations for how to collect and store publicly available data from social media in the context of sensitive populations such as HIV-infected people?

Nadir Weibel

Posts: 5

In our research we are collecting data from Twitter posts that are publicly available. We used the Twitter APIs to collect the data and store it on our secure servers. Although the data is publicly available on Twitter, our research tries to make inferences on the risk of contracting or transmitting HIV by analyzing the content of the tweets, as well as the social network around someone who is tweeting about different kind of risk behaviors (i.e: drug abuse, specific sexual behaviors, particular at-risk venues, etc.). Therefore we wanted to make sure that the data collected is properly anonymized and stored.

After long discussion and handling with the iRB, we ended up specifying our protocol around a few guidelines that I paste below from our approved IRB protocol at UC San Diego.

All data collected will be linked to a unique ID, which will be the only way of tracking a participant within our database.  Of note, not only study participants will have unique IDs, but also any third parties identified through private tweets will also be linked immediately to a unique ID. Additionally, the content of the tweets will be redacted to replace any direct reference to another user through the @ sign with the unique ID assigned to that user. We will redact URLs directly integrated in the tweets and, after classifying them, we will apply specific “Name Entity Recognition” Natural Language Processing filters to attempt to remove Personally Identifiable Information (PII), such as the MITRE approach ( Additionally in an effort to limit content reconstruction and linkage to the original users who posted the message, we will remove stop words and punctuation from tweets and uniform them to a lowercase form, and apply stem words reduction techniques. Finally, all collected geographic information based on latitude and longitude will be abstracted at their 5-digit ZIP code level.

In addition to data collection, our local IRB also required us to specify security protocols in place in the following way

Any data collected as part of this study that is stored and/or is transferred via the internet will follow our data security process as outlined below. With the fast-developing technology, dependable and comprehensive data security measures are key components to defy the perceived threats of Internet hackers and accidental disclosure of confidential information. In the following we provide a summary of the key features pertinent to this project.

- A confidential participant identification number is used for all data collection, recording, and submission to the project database. 
- Data that contain any participant identifiers (e.g., name or contact information) other than the unique identifier are password protected and accessible only to staff members whose job requires knowledge of such data. 
- Research staffs are instructed not to disseminate any participant identifiers in any communications with, or data submissions to, any other collaborators. Any data transfer over the Internet uses encryption. 
- Data transfer and all Web-based utilities use secure access (user and server authentication, 128-bit SSL encryption). This type of encryption is the same as is used for Web-based transactions that involve credit cards or Web banking. 
- The server infrastructure will be setup to minimize any intrusion and potential loss of data. In particular hard drives will be encrypted, firewalls will follow a DROP default policy, only allowing access to the services that need to be exposed on the network (such as the web server); HTTPS and SSL security will be set to the strongest possible standard (an “A” rating provided by the Qualisys SSL Lab report), including disabling SSL3 access, as well as 1-step SSL certificate download; application services will be separated using secure virtualization software such as Docker; a dedicated Apparmor profiles tailored to the specific services and requirements of our application will be defined on the server.

Nadir Weibel

Posts: 5
posted in reply to nadir weibel

In our phase 2, we also recruited participants from our local cohort and tracked their Twitter account specifically, to be able to better understand their risk behavior. This proved to be particularly challenging in terms of human subjects protection

Our protocol was again negotiated with the IRB that finally approved following language for describing how we planned to interact with cohort research participants

After appropriate informed consent (see attached), participants will be requested to provide us with demographic information (age, gender, zip code, sexual orientation and sex and drug practices), and also to allow us access to their private Twitter accounts. We will build a simple Twitter application that will allow enrolled users to easily give us access through the application to key information from their Twitter posts. Our developed Twitter application will generate a user-based authentication token that will allow us to access their posts with a “user context”, i.e. seeing everything participants sees, this will include social connection as well as geo endpoints (see

Additionally, we specified data collection procedures in relationship with our Twitter application

Subjects will be asked to install a free Twitter application in their Twitter account, that will not be creating any new data and all existing data remains the sole property of the profile owner, which adheres to Twitter’s policy on data ownership. The application itself will be owned and maintained by the co-PIs, and it will be active for as long as the co-PIs have permission from the IRB to keep it active. The  application will be deployed on our servers, and its sole role will be to link the recruited participants’ twitter profiles to our study. The installation process will therefore be limited to associating the subjects Twitter account through the generation of an access token. The access token will be stored securely on our server, and nothing will be installed locally on the computer or the smartphone of the participants. Our application will be listed in the “Apps” list of the Twitter account’s settings and participants will be able to revoke access to their account at all times by clicking the “revoke” button associated with the application. This will also include temporary access revocation with the goal of short periods of “private” activity, if participants decide they don’t want to share these particular data. The “revoke access” functionality can be deactivated, by selecting the Twitter “Undo Revoke Access” button for the specific application.

As part of the installation phase, participants will be presented with information about the study:
 “This application is part of a scientific study carried out at the University of California San Diego (UCSD), investigating the structure of social networks to inform HIV prevention. If you use this application, you consent to participate in this study. Your demographic information, as well as information about your social network (friends) and your interaction with them will be collected, anonymized and stored on a secure server at UCSD. Your data will always remain confidential and may be studied by researchers or used in scientific publications. If you do not consent to participate in this study, please do not use this application. For more information or detailed description of the study, please go to or contact the researchers by email at”

By proceeding with the installation of the application they will consent for the application to collect data as described below. As part of the installation process we will also provide the participant agreeing to participate in the study with a link to the consent form.
After consenting, subjects will be asked to authenticate the application to access Twitter information about the subject. We will first confirm that the participant has the required amount of Twitter activity as required per the inclusion criteria, and then we will proceed to collect initial information about the participant, and then collect the time and type of interactions with entities published on the participant’s Twitter account. The initial information about the participant will include their basic profile information (Twitter name, time zone, location, website). By using the list of followers and followees retrieved from the participant’s account we will then collect Twitter’s interactions from published tweets, and re-tweets.  The PIs will have access to only the consented users’ sent protected tweets and recipients’ user IDs.  Any private tweets that are not generated by the study subject will not be available to researchers.  Private tweets shall be designated as private in the collected research data.

Attached the consent form approved by our IRB.

Camille Nebeker

Posts: 56

This is great information, Nadir! Thanks for sharing on the CORE Forum! Would be great to have our CORE Network members chime in and rate the standards that you have noted. That is...would the research community (e.g., researchers, IRBs, participants) consider this a 'best practice' for other researcher to follow?

Marta Jankowska

Posts: 3

The IRB language is very helpful, the implementation seems incredibly cumbersome and complex. I'm assuming you develop algorithms that would clean the twitter data as you described in the IRB language. Did this take long, and how complex/cumbersome was developing the necessary data cleaning protocols?