Back to Forum Back to Top


Twitter Data Storage


Back to Forum

Browsing: Twitter Data Storage

Camille Nebeker

Posts: 56

How should Twitter data be stored if it includes potentially sensitive information and is being used for research purposes?

Rubi Linares-Orozco

Posts: 33

PII is defined as "any information about an individual maintained by an agency, including (1) any information that can be used to distinguish or trace an individual‘s identity, such as name, social security number, date and place of birth, mother‘s maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information."

So, for example, a user's IP address is not classed as PII on its own, but is classified as linked PII (see Section 3.3.3 Under “Identifiability” for more detail). If the data contains several identifiers (user handle, Geocode, IP address) the data could be traced back to the user profile.

Data that contains PII requires the same security protections as PHI. It is strongly recommended that you use a secure server that is HIPAA compliant if personal identifiers are the minimum necessary for the coduct of the research. (Typically there is a financial cost associated with keeping data on a "HIPAA" secure server, I recommend that you relay these questions to your respective IT department for technical solutions and cost analysis.

An alternative option would be to obtain filtered data from Twitter (removing PII), which would then mean the dataset is being provided to the research as a de-identified data set (no PII), and therefore does not require the extensive security precautions of PII/PHI.

Another option is to mask the identifiers with a code, so that the data-set is automatically converted before it is stored, such as dropping certain pieces of a Geo-code/ IP address sufficient to answer the research questions, but without the ability to identify/ or re-link the data to the user.