A Proposed Model to Allow Data Mining Classification Avoiding Privacy Concerns

Document Type : Original Article

Abstract

Data Mining aims to discover hidden facts that exist in the databases and data
warehouses. The discovered data should not reveal secrets that are considered
private for individuals or groups. In recent years, there have been privacy concerns
over the increase of gathering personal data by various institutions and merchants
over the Internet. There has been increasing interest in the problem of building
accurate data mining models over aggregate data while protecting privacy at the
level of individual records. One approach for this problem is to randomize the
values in individual records, and only disclose the randomized values. This method
is able to retain privacy while accessing the information implicit in the original
attributes. The distribution of the original data set is important and estimating it is
one of the goals of the data mining algorithms.
This paper introduces the privacy concerns and the obvious conflict between
privacy and data mining. Then, two approaches to resolve this conflict are
introduced, namely: the randomization approach and the cryptographic approach.
We consider the case of performing data mining classification for randomized
data. Two proposed algorithms for data mining classification of randomized data
,with high accuracy compared to classification algorithms for non perturbed data,
based on Bayes rules will be introduced (Step-Class, and Global-Decision).
These two algorithms are experimentally tested to measure the classification
accuracy of each of them. Our empirical results show that the Step-Class algorithm
has better performance results (classification accuracy ratio) than the Global
decision algorithm.

Keywords