In this paper I collect healthcare data, which consists of all the details of the patients' symptoms, disease etc. After the collection of data, there will be pre-processing on all the details of the patients' data, as we need only filtered data for our analysis. The data will be stored in Hadoop. A user can retrieve the data by symptoms, disease etc.

Big Data is a collection of large and complex data. It consists of structured, semi-structured, and unstructured types of data. Data gets generated from various sources and from different fields. In today's era, data is being generated in huge amounts. The whole world is moving towards the digitalization. Social media sites, digital pictures and videos, and many others. All this type of data is known as big data. Data mining is a useful technique for extracting a pattern. This is helpful from large scale data sets. Useful and meaningful data can be extracted from this big data with the help of data mining by processing on that data.

Leseprobe

1. INTRODUCTION

1.1 IDEA AND MOTIVATION

1.2 LITERATURE SURVEY

2. PROBLEM DEFINITION AND SCOPE

2.1 SCOPE

2.2 SOFTWARE CONTEXT

2.3 SOFTWARE CONSTRAINTS

2.4 OUTCOMES

2.5 HARDWARE SPECIFICATION

2.6 S/W SPECIFICATION

2.7 AREA OF DISSERTATION

3. DISSERTATION PLAN

3.1 PROJECT PLAN

3.2 TIMELINE OF PROJECT

3.3 FEASIBILITY STUDY

3.3.1 Economical Feasibility

3.3.2 Technical Feasibility

3.3.3 Operational feasibility

3.3.4 Time Feasibility

3.4 RISK MANAGEMENT

3.4.1 Project Risk:

3.4.2 Risk Assessment

3.5 EFFORT AND COST ESTIMATION

3.5.1 Lines of code (LOC)

3.5.2 Effort

3.5.3 Development Time

3.5.4 Number of People

4. SOFTWARE REQUIREMENT SPECIFICATION

4.1 INTRODUCTION

4.1.1 Purpose

4.1.2 Scope of Document

4.1.3 Overview of responsibilities of developer

4.2 PRODUCT OVERVIEW

4.2.1 Block diagram

4.3 FUNCTINAL MODEL

4.3.1 Flow diagram

4.3.2 Data Flow Diagram

4.3.3 UML Diagrams

4.3.3.1 Sequence diagram

4.3.3.2 Class diagram

4.3.4 Non-Functional Requirements

4.4 BEHAVIORAL MODEL AND DESCRIPTION

4.4.1 Description of software behavior

4.4.2 Use case diagram

5. DETAILED DESIGN

5.1 ARCHITECTURE DESIGN

5.1.1 Algorithms:

5.2 INTERFACES

5.2.1 Human Interface

5.2.2. Database interface

6. TESTING

6.1 INTRODUCTION

6.1.1 Goals and Objective

6.2 TESTING STRATEGY

6.2.1 White Box Testing

6.2.2 Black Box Testing

6.2.3 System testing

6.2.4 Performance testing

7. DATA TABLE AND DISCUSSION

7.1 INPUT TO THE SYSTEM

7.2 OUTPUT:

7.3 PERFORMANCE OF PROPOSED SYSTEM

7.3.1 Performance of proposed system with respect to baseline algorithm:

7.3.2 Performance of proposed system with respect to blowfish encryption algorithm:

7.4 RESULT

7.4.1 Difference between proposed algorithm and base algorithm i.e provider aware algorithm:

8. SUMMARY AND CONCLUSION

8.1 FUTURE ENHANCEMENT

Objectives & Core Topics

The primary objective of this work is to develop an effective data mining and anonymization method to protect sensitive information during collaborative data publishing in distributed network environments, aiming to achieve higher performance compared to existing encryption-based approaches.

Data Anonymization using Slicing techniques
L-diversity and privacy constraint verification
Collaborative data publishing in distributed databases
Comparative performance analysis of algorithms (Slicing vs. Encryption)
Prevention of "insider attacks" using background knowledge

Excerpt from the Book

5.1.1 Algorithms:

Slicing is basically depends on Attribute and tuple partitioning. In Attribute partitioning (vertical partition) we partitioned data as {name},{age-zip} and {Disease} and tuple partitioning (horizontal partition) as {t1,t2,t3,t4,t5,t6}. In attribute partitioning age and zip are partitioned together because they both are highly correlated because they are quasi identifiers (QI). These QI can be known to attacker. While tuple partitioning system should check L diversity for the sensitive attribute (SA) column. Algorithm runs are as follows.

1. Initialize bucket k=n, int i= rowcount, column count=C, Q={D}, // D= data into database, Arraylist= a[i];

2. While Q is not empty

If i<=n

Check L diversity;

Else

i++;

Return D*;

3. Q=Q-{D*+a[i]};

4. Repeat step 2 and 3 with next tuple in Q

5. D*=D*U A[D] // next anonymized view of data D

Summary of Chapters

1. INTRODUCTION: Discusses the necessity of privacy-preserving data sharing in distributed environments and introduces the motivation behind using slicing techniques over traditional encryption models.

2. PROBLEM DEFINITION AND SCOPE: Defines the goal of creating an anonymized, attack-immune data view and outlines the software context, constraints, and hardware/software environment.

3. DISSERTATION PLAN: Provides a comprehensive project timeline, feasibility study (economical, technical, operational, and time), and detailed risk assessment and cost estimation metrics.

4. SOFTWARE REQUIREMENT SPECIFICATION: Details the functional and behavioral models, including block diagrams, data flow diagrams, and UML diagrams for the proposed system.

5. DETAILED DESIGN: Explains the architectural design and the specific slicing and L-diversity algorithms implemented to maintain data privacy.

6. TESTING: Describes the testing strategies employed, including white box, black box, and performance testing, to validate system reliability and computation efficiency.

7. DATA TABLE AND DISCUSSION: Evaluates the system results by comparing the computation time and performance complexity of the proposed slicing algorithm versus baseline and blowfish encryption models.

8. SUMMARY AND CONCLUSION: Summarizes the research findings regarding privacy protection in collaborative data publishing and suggests potential future improvements, such as ad hoc grid computing implementation.

Keywords

Data Mining, Slicing, Anonymization, L-diversity, Distributed Database, Privacy Preserving, Blowfish Encryption, Collaborative Data Publishing, Insider Attack, Quasi Identifiers, Sensitive Attribute, Security, Software Requirement Specification, Data Utility, Performance Metrics

Frequently Asked Questions

What is the core focus of this research?

The research focuses on enhancing data privacy during collaborative data publishing across distributed databases by utilizing slicing techniques as an alternative to computationally expensive encryption methods.

What are the primary thematic areas covered?

Key areas include data anonymization, attribute and tuple partitioning, performance measurement of algorithms, the L-diversity model, and security assurance in distributed networks.

What is the ultimate goal of the proposed system?

The aim is to provide a privacy-preserved, anonymized view of integrated data from different providers that is immune to attacks while maintaining optimal computation time.

Which scientific methodology is utilized?

The work employs a comparative methodology, using algorithmic analysis and performance evaluation (computation time and complexity) to contrast the proposed Slicing method with existing encryption and provider-aware algorithms.

What does the main body of the work address?

The main body focuses on the design of the system, the development of the Slicing and L-diversity algorithms, and the rigorous testing (black/white box) of these implementations against various threat scenarios.

Which keywords best characterize this work?

The work is best characterized by terms such as Data Mining, Slicing, L-diversity, Distributed Database, Anonymization, and Privacy Preserving.

Does the system address specific types of attacks?

Yes, the system is designed to detect and protect against "insider attacks" where entities may attempt to breach privacy using background knowledge.

Why is the proposed slicing method preferred over standard encryption?

The proposed method offers a significant reduction in computation time, making it more efficient for systems like hospital patient data management or banking where rapid access is required alongside privacy.

How is the system validated?

Validation is performed through a multi-stage testing process including white-box and black-box testing, alongside graphical performance comparisons illustrating CPU usage and execution time.

Ende der Leseprobe aus 40 Seiten - nach oben

Details

Titel: Effective Data Mining Techniques for Unstructured Data in Big Data
Hochschule: Rajiv Gandhi University (PATEL COLLEGE OF SCIENCE AND TECHNOLOGY)
Veranstaltung: COMPUTER SCIENCE
Note: 10
Autoren: Dnyandeo Khemnar (Autor:in), Nilesh Thorat (Autor:in)
Erscheinungsjahr: 2017
Seiten: 40
Katalognummer: V1307474
ISBN (eBook): 9783346783394
ISBN (Buch): 9783346783400
Sprache: Englisch
Schlagworte: Big data Data mining Hace theorem Map Reducer Privacy Preservation Mechanism.
Produktsicherheit: GRIN Publishing GmbH

Arbeit zitieren: Dnyandeo Khemnar (Autor:in), Nilesh Thorat (Autor:in), 2017, Effective Data Mining Techniques for Unstructured Data in Big Data, München, GRIN Verlag, https://www.grin.com/document/1307474

Effective Data Mining Techniques for Unstructured Data in Big Data