Data Privacy and Security at Pathos
Deep learning and neural networks are currently at the forefront of AI research and are used as a means for powerful predictive analysis across many applications and industries. With powerful predictive power comes the need for these deep learning models to consume massive amounts of data to fine-tune the models’ predictions during the model training process.
It is often said that the life of a data scientist is disproportionately-dominated by acquiring, labeling, transforming, and feeding data into these models. This is often not the exception but the rule and is also true here at Pathos where we deal with a large amount of data. With such data comes the inherent responsibility of protecting this data, ensuring the use of this data is compliant with local regulations, and determining what parts of the data are useful.
Although Pathos is a young startup, we’ve taken great lengths to make sure that we are operating as responsibly as possible.
Pathos is based in Canada where data privacy is enforced via the Personal Information Protection and Electronic Documents Act (PIPEDA). Similar to our government frameworks, PIPEDA has general guidelines on how to obtain meaningful consent and what types of practices are deemed inappropriate. Most of these are left up to specific interpretations on specific case scenarios.
Because Pathos deals with consumer data, much of the data we handle contain natural language text based on what consumers are feeling and their personal comments/opinions on a specific topic and/or product/service. To ensure that we’ve observed and considered the privacy of such data, we have implemented the following framework.
We perform inquiry and due diligence inspection to confirm with our clients that the data that they intend to provide to Pathos has undergone clear and specific consent for their own collection.
Additionally, we also specify clearly on the Pathos website and any agreements between us and clients who provide us data for our work. By being specific about how we intend to use the data, we are providing a clear means for clients to judge whether to provide us consent to use their data.
1.2. Sensitive Data (PII)
Datasets typically contain a lot of data, and some of this data may be deemed sensitive or in other words, personally identifiable information (PII). PII refers to a class of data, such as a person’s full name, address, date of birth, phone number, etc. which is sensitive and can be used to identify a particular person. At Pathos, we handle massive amounts of data in different formats and our core guiding principle is that we should a) eliminate all PII-type data whenever possible and b) choose to use PII data sparingly and only given appropriate consent and clearance. We apply what is known as data classification which determines the sensitivity of each dataset, which then governs the digital security requirements of storing and using such datasets. Given that data is in varying formats, we’ve developed an overarching framework to guide us on either removing or limiting PII data that we work with. This is broken down into the various formats we come across:
1.2.1. Structure Text Data
These datasets are in a tabular format, similar to a CSV file or Excel spreadsheet, where each row represents a “record” and each column is an attribute or “feature” of a record. When encountering structured data, we apply the following framework to appropriately classify and optionally redact PII-sensitive information.
- Create a Data Classification Document (DCD) by extracting the column headers for each dataset into a separate file (known as a “schema” or “table definition” file). This separate file only contains the table headers and data types (e.g. number, string, etc.) and can be used to validate whether data in each column is sensitive or not.
- Pathos then executes a series of scripts using regular expressions (a programming term applied to pattern-matching activities) to identify columns where PII data is present and notes this accordingly in the DCD. In addition to regular expression matching scripts, we are always researching newer tools to increase the effectiveness and automation of identifying PII-sensitive data in a way that’s replicable and consistent across all clients and datasets.
- Lastly, Pathos takes the DCD back to the client who provided the data and validates whether the identified PII columns are required. If this is not required, we securely remove the columns in the dataset containing PII data and classify the resulting dataset simply as “Internal” noted in the DCD. If it is found that some PII columns need to be retained, we upgrade the classification to “Internal – PII sensitive” in our DCD and apply the appropriate digital security measures.
1.2.1. Unstructured Text Data
These datasets are typically comprised of paragraphs of text in no static format and/or length. Examples of unstructured data could be text in a comments box on a customer feedback form. Detecting sensitive data is more challenging compared to structured text data as there are no identifiers to forewarn of data sensitivity (e.g. lack of column headers).
As with the above, a Data Classification Document (DCD) is created for each dataset. However, the format of the DCD is largely-dependent on the format of the dataset. In the case of a dataset of tweets, we can develop a unique identifier system to tag each tweet for reference (e.g. “Dataset2-Tweet25” would identify the 25th tween in the second dataset received). We then execute regular expression matching scripts over these datasets and determine whether any text corresponds to a pattern we recognize as being PII-sensitive (e.g. address, phone number, social insurance number, etc.). When PII-sensitive text is found, the script then either removes these sensitive text tokens entirely or replaces these tokens with a redacted version (e.g. <REDACT_ ADDRESS> replaces the string “1234 Dunhill Crescent”).
Once completed, Pathos confirms the DCD findings with the client, similar to structured data to ensure that the client knows what PII data is found, and what we’ve done to remove or redact such data.
Whether the data that we at Pathos handles is classified as PII-sensitive, internal non-PII, or public non-sensitive (i.e. datasets freely downloadable by the public), data security is still at the forefront of everything we do. Pathos has developed a high-level framework for which we’ve begun deploying to our infrastructure and client projects using a layered approach.
2.1. Data Residency and Locale
The first step in securing our data assets is to ensure that the data we’ve sourced from our clients for which we have obtained clear consent on usage is stored within the operating jurisdiction of our clients, wherever possible.
2.2. Access Controls
Once data residency has been established, Pathos has a base set of controls in locking down access to digital assets, inclusive of our clients’ data to any repositories and/or workspaces that may host such data.
2.2.1. User Roles
As we are a young organization with limited staff, we’ve taken the first steps in defining a general classification of access rights for which we apply to our infrastructure.
These roles are applied throughout our infrastructure and computing assets, whether on-premise or cloud-hosted.
2.2.1. Periodic Reviews
On a predefined basis multiple times a year, Pathos conducts an internal access review of all computing assets to enforce the removal of any excessive access that may arise from staff transfers, terminations, and other deciding factors. This ensures that access is always kept to a minimum for any computing asset.
2.3. Infrastructure Security
In addition to data residency and infrastructure/application/database access controls, Pathos also continuously works to ensure that sensitive access such as client data and solutions are segregated. To accomplish this, we’ve taken a modular approach where each client solution/project is hosted on a separate workspace/compute instance where all related data and application/infrastructure are segregated. This segregation occurs even if we undertake multiple projects with the same client.
Data Security is central to our operating principles. In building and implementing these security frameworks, Pathos strives to be a responsible digital citizen and to build and maintain trust for all our clients and customers we work with.