Hadoop security system quotes by some security issues. At its beginnings, developers paid attention to the development of basic functionalities mostly, and proposal of security components was not of prime interest. Because of that, the technology remained vulnerable to malicious activities of unauthorized users whose purpose is to endanger system functionalities or to compromise private user data. Researchers and developers are continuously trying to solve these issues by upgrading Hadoop’s security mechanisms and preventing undesirable malicious activities. In this paper, the most common HDFS security problems and a review of unauthorized access issues are presented.
First, Hadoop mechanism and its main components are described as the introduction part of the leading research problem. Then, HDFS architecture is given, and all including components and functionalities are introduced. Further, all possible types of users are listed with an accent on unauthorized users, which are of great importance for the paper. One part of the research is dedicated to the consideration of Hadoop security levels, environment and user assessments.
The review also includes an explanation of Log Monitoring and Audit features, and detail consideration of authorization and authentication issues. Possible consequences of unauthorized access to a system are covered, and a few recommendations for solving problems of unauthorized access are offered. Honeypot nodes, security mechanisms for collecting valuable information about malicious parties, are presented in the last part of the paper. Finally, the idea for developing a new type of Intrusion Detector, which will be based on using an artificial neural network, is presented. The detector will be an integral part of a new kind of virtual honeypot mechanism and represents the initial base for future scientific work of authors.
The Hadoop is a master-slave open source platform for storing, managing and distributing data across a more significant number of servers [7]. It is a Java-based solution to the majority of Big Data issues that is distributed under the Apache License. It is a highly accessible technology that operates with the large volumes of data and can be used for high-speed distribution and processing of information. Hadoop efficiently resolves the “3V” challenge by providing next features to a system: a framework for horizontal scaling of large data sets, for the handling of furious transfer velocity rates and efficient framework for processing a variety of unstructured data. Also, it can handle the failure of a single machine by re-executing all of its tasks. However, in the large-scale system as Hadoop, the occurrence of failures is unavoidable.
On a basic level, Hadoop is built from two main components [8]: MapReduce and Hadoop Distributed File System (HDFS). MapReduce component is used for the computational implementation of Hadoop in the form of distributed processing of data. It organizes multiple processors in a cluster to perform required calculations. MapReduce distributes the computation assignments between computers and puts together final computation results in one place. Additionally, this component takes care of network failures in the way they do not disturb or disable active computation processes. On the other side, HDFS is used for information management and distributed storage of data. It is the file system component that provides reliable and scalable storage features and global file access option. HDFS component is of the main interest in this paper so that it will be additionally explained in the next two subsections.
1.2. HDFS Architecture
The main goals of HDFS are storing large amounts of data in clusters and providing a high throughput of information within a system. Data stores in the form of the same sized blocks, where the typical size of each block is 64 MB or 128 MB. Depending on the size, each file is stored in one or a few blocks. Size of a block is configurable, and each file can have one writer at the moment. Within the HDFS component, a client can create new directories, create, save or delete files, rename and change a path of a file, and etc.
HDFS architecture is based on the master-slave principle, and it is built from a single NameNode and group of DataNodes [9]. The NameNode (the master node), as the core part of the system, manages the HDFS directory tree and stores all metadata of the file system. Clients communicate directly with the NameNode with the purpose to perform standard file operations. Further, the NameNode performs a mapping between files stored at DataNodes with proper file names. Another function is monitoring the possible failure of a DataNode and resolving this issue by creating a block replica [10]. The NameNode can have two other roles in the system: it can act as a CheckpointNode and a BackupNode. A periodical checkpoint is an excellent way to protect the system metadata. On the other hand, BackupNode maintains file image that is synchronized with the NameNode state. It handles potential failures and rolls back or restarts using the last good checkpoint.
Additionally, in enterprise versions of Hadoop, there is a practice to introduce Secondary NameNode. It is a useful system addition in a case the original NameNode crashes. In that case, Secondary NameNode uses saved HDFS checkpoint and restarts crashed NameNode. DataNodes are proposed to store all file blocks and to perform the tasks that are delegated by the NameNode. Each file of a DataNode can be split into a few blocks and labelled with an identification timestamp. These nodes are used to provide service of writing and reading of desired files. By default, each data block is replicated three times, of which, two copies are stored in two different DataNodes in a single rack, and a third copy is saved on a DataNode which belongs to another rack.