Invented by Olivia Choudhury, Aris Gkoulalas-Divanis, Theodoros Salonidis, Issa Sylla, International Business Machines Corp
The International Business Machines Corp invention works as followsA computer-implemented training method for a global federated model using an aggregate server includes training local models on local nodes. Each local node chooses a set attributes from its own training dataset to train its local model. Each local node creates an anonymousized training dataset using a method of syntactic anonymization, selecting quasi-identifying characteristics from the training attributes and generalizing them using a syntax algorithm. Each local node also computes a mapping syntactic based on the equivalence class produced in anonymized training dataset. The aggregator computes the union of all mappings that are received from local nodes. The federated model is also trained by sending parameter updates calculated over local models by the nodes iteratively to the aggregator. The aggregator servers aggregates all received parameter updates and sends them to the local nodes.
Background for Anonymizing Data for Preserving Privacy During Use for Federated Machine Learning
The present invention is a general computer technology and, more specifically, it relates to anonymizing data used to train a federated system of machine learning.
Deep learning, also known as deep structured or hierarchical-learning methods, may be based upon artificial neural networks. Deep learning architectures like deep neural networks (DNN), recurrent neural network (RNN), convolutional neural network (CNN), and deep belief networks have been used and are still being used across many fields, including computer vision and speech recognition. These machine learning techniques analyze data in a similar way (or sometimes better) than a human would. The machine learning techniques first require large volumes of data to be used to train such deep learning machines/architectures to be able to ?think? “Think like a Human.
According to one or more embodiments, a computer implemented method for federated education is described. The method involves training a global learning model with an aggregator and local models that correspond to local nodes. The local models are trained at the nodes as part of the operations for training the global learning model. The first node receives a training dataset containing multiple attributes associated with many records in order to train a local model. Training a first model also includes selecting by the first node a set from the training data, which is to be used to train the first model that corresponds to the said first node. In addition, the first node is responsible for generating an anonymized dataset of the training dataset. This anonymization is done using the syntactic anonymization technique. Syntactic anonymousization involves selecting the quasi-identifying attribute from the set of attributes that will be used to train the first local model. Syntactic anonymousization also includes generalizing the pseudo-identifying attributes by using a syntactic method. The syntactic anonymization process also includes computing a mapping of the first kind based on equivalences classes in the anonymized dataset. The federated method also includes sending multiple mappings by each local node to the aggregator servers, where the mappings are computed using the equivalence class in the anonymized dataset of each local node. The federated method also includes the aggregator computing a union from the multiple mappings that are received by the local nodes. The federated method also includes training the global model iteratively. The local nodes can train the global model by using machine learning algorithms and anonymized datasets. The local nodes can also send multiple parameter updates computed from local models to the aggregator servers. The aggregator server computes the global learning model for the global federated model, after aggregating multiple parameter updates received from local nodes. The global federated model training includes sending the aggregated parameters from the aggregator to the local nodes.
Embodiments” of the present invention may include a system or computer program, or even a machine, that performs one or several described operations.
Embodiments” of the invention enhance the privacy and security in a federated system. In addition, embodiments improve accuracy of a global machine learning model when compared with federated approaches that use non-syntactic protocols like differential privacy. In addition, embodiments of this invention have a significantly lower computational and communication cost than existing approaches which use cryptographic protocols to protect the privacy of data. Further yet, technical solutions provided by one or more embodiments of the present invention facilitate compliance with privacy regulations that provide requirements around adequate data de-identification/data anonymization. The embodiments of the invention also provide lower infrastructure costs for maintaining data in order to participate in federated education, as anonymized data does not require secure storage. The anonymized datasets at each local site/node can also be reused, making them a valuable resource that supports multiple types of analysis, in addition to their use in federated-learning. In addition, embodiments reduce the need for strict fire wall rules at local sites/nodes. This further reduces infrastructure costs.
The global ML models can achieve high accuracy predictions due to the large amount of data that is available from multiple sites.
Technical features and benefits that are not previously available can be realized by using the techniques of this invention. The invention is described in great detail and includes all aspects and embodiments. Refer to the drawings and the detailed description for a better understanding.
Federated Learning (FL)” is a machine-learning technique that allows you to train a global machine-learning model using data from multiple sites without moving the data. The process involves training local models, updating weights on the local models, sending updates to the aggregator server and updating the global models to be shared across the sites. Federated learning is a way to train any machine learning algorithm. For example, deep neural networks (DNN) using local datasets from multiple sites. The multiple sites and the aggregator server (or “local nodes”) The aggregator server and the multiple sites (or?local nodes?) exchange parameters, e.g. The weights of a neural network are exchanged between the local nodes on a regular basis to create the global machine-learning model.
The global machine-learning model is trained in order to perform a predictive analysis.” The global machine-learning model provides a prediction for new data samples. In a health application, for example, the global machine-learning model can be trained to predict medical diagnoses and/or prescriptions when given training data on a user’s background, including demographics and past diagnoses. In an ecommerce application, the machine learning model can be trained to predict the probability that a user will purchase an item based on their demographic data and the number of times they have visited a webpage or used an online-application. The examples above are merely examples, and in some embodiments different training data may be used to perform different prediction analyses.
It should be noted that the central server can act as an aggregator for federated-learning to control or coordinate one or more steps in the algorithm. The local nodes can also be trained peer-to-peer, without a central server. Instead, they may communicate with each other to coordinate the steps of federated algorithm.
For instance, federated-learning can be used to generate a machine-learning model for analyzing health data. For health applications, it is possible that a large number (100+ sites) of federated sites are not available. In this case, federated-learning relies on the data of sites like hospitals and/or agencies. Each site may also not have enough data to apply deep learning models. Federated learning uses data from several sites to create one or more machine learning models.
An important technical challenge in federated learning involves maintaining privacy while learning the global model of machine learning. This is achieved by protecting privacy on a local server when sharing updates with other local servers and/or the central server. Privacy attacks can still be made even if the raw data from the local node is not shared, such as during the gradient exchange. Existing privacy-preserving techniques in federated learning are currently based on secure aggregation and differential privacy protocols. The low utility of differential privacy techniques (model accuracy), due to excessive noise added to model parameters in order to provide privacy, is a result of the fact that they are not compliant to regulatory frameworks like the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act(HIPAA) and others. The existing secure aggregation protocol is limited in its ability to protect data from untrusted server and only allows for certain types of computations. The aggregated data can be used by the existing protocols to identify data associated with a user (subject) through manipulation of the data. The existing protocols for secure aggregation have a high computational overhead because of the expensive cryptographic operations used.
Embodiments” of the invention offer technical solutions for addressing such challenges in protecting the privacy of data when operating federated systems. The embodiments of the invention provide, therefore, improvements to computing technologies, especially federated systems. Also, embodiments provide a practical solution for protecting data transferred/shared between multiple local nodes to train a federated system for learning and for generating a global machine learning model. In particular, fields like healthcare, banking, and so on. Protecting data privacy, especially when personal data is stored locally at nodes, is essential. Without it the federated systems could not function due to various privacy and regulatory concerns.
One or more embodiments facilitate the use of syntactic methods for privacy-preserving learning. As described herein in detail, a syntactic method removes direct identifiers, protects potential linkable quasi-identifiers, and leaves the non-QID values in place in each local node’s data that will be used to train the federated machine learning model. The noise is reduced compared to differential privacy, and the model accuracy is much higher than existing techniques. The syntactic methods described here are compliant with privacy regulations such as HIPAA and GDPR. The syntactic methods can also be applied differently by the different local nodes of the federated system. The syntactic approaches can be used at the local nodes to process different types of information, for example relational data or transaction data. In this way, the invention allows for the application of syntactic anonymization in a federated system to protect data that is shared between various components.
The implementation of a syntactic method in an FL setting is accompanied by a number technical challenges. These arise from the necessity to coordinate anonymization between sites during and after FL-training. One or more embodiments address these technical challenges by providing technical solutions that facilitate the addition of syntactic anonymization to the FL training process at local sites. The syntactic anonymization of the original data is performed at each local node. The anonymized datasets are then used to train the global FL model. The anonymization is performed on records of data that can include a variety of data types such as sequential data or location data. Such data can include both a relational and transactional component, for example patient demographics and diagnoses. This offers protection from adversaries with knowledge of individuals that spans these data types. The technical solutions offered by one or several embodiments of this invention also facilitate a global anonymization map process, which aids in the prediction using anonymized data.
FIG. The FIG. 1 shows a federated system of learning 100 in accordance with one or more embodiments. The system 100 consists of an aggregator 110 in communication with the local nodes at each site 120A,120B,120C. Computing systems are the aggregator 110 and local nodes (120A, 120B, and 120C). Computing systems can be the same or different at each site. In one or several embodiments of the invention, the aggregator 110 can be viewed as another site. The local node located at this site is designed to aggregate the data from one or more local servers 120A,120B,120C. While only three local nodes (120A, 120B, and 120C) are shown in the example herein, the federated system 100 may include other local nodes. For example, there could be 2, 10, 15 or so.
Each node local 120A,120B,120C has its own local data D1,D2,D3. Local data D1, D2, D3 could be any data type that has been accumulated by the local nodes of 120A,120B,120C to allow for further analysis. Local data D1,D2,D3 can include healthcare data or banking data. They can also be ecommerce transactional, online transactional, or other types of data. It should be appreciated that although the local data D1, D2, D3, at the respective local nodes 120A, 120B, 120C, are all from the same domain, they can be accumulated/collected and stored in different manners. The data could, for example, be a single type of data (e.g. relational data or transaction data). The data may also include more than one data type (e.g. relational data with transaction data or relational and user trajectories .).
Furthermore, each local node, 120A,120B,120C, can use a different database system or file system to store the data. Each local node, 120A,120B,120C can also have different types (more or fewer) of parameters. As an example, a local node may have accumulated data, which is part of local data D1, of a user. This data could include the user’s name (including gender), age, zip code, phone number, medical conditions and prescriptions. Local node 120B, for example, may have accumulated as part of local data D2, the data of a user, including the user’s age, gender and medical conditions. In this example, it can be seen that the two datasets are different in terms of parameters, and also the order in the which they are stored. Other embodiments may have additional differences.
In one or more examples the local data D1, D2, D3 are preprocessed to ensure that each record of the data transferred and shared in the federated-learning, contains the data for a unique individual user. The shared data is represented by a set of attributes. These attributes can be common to all sites, or they may have additional attributes in each site’s data.
It should be understood that, although the examples herein describe data records for individual users, other embodiments of this invention can use the data to create other widgets. These examples use data that could be used to launch inference attacks by sites or aggregation servers using personal data for health applications.
In one or more embodiments, local data D1,D2,D3 at each site are used to train a local machine-learning (ML) model at the respective sites. Local models 122A, 122B, 122C may be created/trained by using the respective machine learning techniques (CNN, RNN, combinations thereof, etc.). At each local node (120A, 120B and 120C).Click here to view the patent on Google Patents.