, , , , , , , , , , ,

Hadoop vs RDBMS:

RDBMS and Hadoop are different concepts of storing, processing and retrieving  the information. DBMS and RDBMS are in the literature for a long time whereas  Hadoop is a new concept comparatively. As the storage capacities and customer  data size are increased enormously, processing this information with in a  reasonable amount of time becomes crucial. Especially when it comes to data  warehousing applications, business intelligence reporting, and various  analytical processing, it becomes very challenging to perform complex reporting  within a reasonable amount of time as the size of the data grows exponentially  as well as the growing demands of customers for complex analysis and reporting.


What is Hadoop?

Hadoop logo

Hadoop is an open source Apache project. Hadoop framework was written in  Java. It is scalable and therefore can support high performance demanding  applications. Storing very large amounts of data on the file systems of multiple  computers are possible in Hadoop framework.  It is configured to enable  scalability from single node or computer to thousands of nodes or independent  systems in such a way that the individual nodes use local computer storage, CPU,  memory and processing power. Error handling is performed in the application  layer level when a node is failed, and therefore, dynamic addition of nodes,  i.e., processing power, in an as needed basis by ensuring the high-availability,  eg: without a need for a downtime on production environment, of an individual  node.

Hadoop framework was developed based on Google’s MapReduce algorithm. The  term BIG data in an organization is the huge amount of information or data that  is unable to be processed by using traditional methods within reasonable amount  of time. The problem was identified by Internet search  companies that had to query very large amount of unorganized and distributed  data. Big-Data processing becomes very highly demanded practice in these days  and therefore, Hadoop becomes very popular especially for the companies which  process BIG data.  Facebook , AOL , IBM , ImageShack and Yahoo are some of the  companies that have been using Hadoop. Recently, there are hundreds of companies  started working on BIG data processing applications based on Hadoop  framework.


What is RDBMS?

RDBMS is relational database management system. Database management system  (DBMS) stores data in the form of tables, which comprises of columns and rows.  The structured query language (SQL) will be used to extract necessary data  stored in these tables. The RDBMS which stores the relationships between these  tables in different forms such as one column entries of a table will serve as a  reference for another table. These column values are known as primary keys and  foreign keys. These keys will be used to reference the other tables so that the  appropriate data can be related and be retrieved by joining these different  tables using SQL queries as needed. The tables and the relationships can be  manipulated by joining appropriate tables through SQL queries.

The most important attribute of a relational database system is that a single  database system generally has several tables and relationships between these  tables so that the information is classified into tables of independent  entities. They are also stored independently in a normalized or simplified way  and a relationship is maintained within these tables using primary/foreign key  constraints.  This is different from a flat file or data structure. The data on  a database could be stored in a single data file or multiple data files. The  data file size will grow or the new data files will be added as the new records  are added and the size of the database is increased. These all files are  commonly shared by the database server.  In high availability systems, these  data files are shared so that each node will have access to the same data file.   Generally all popular database systems are relational database management  systems. In order to give some quick and easy navigation to related data, some  logical views are created from the actual tables.  There will be a physical  existence for every table in the database whereas a view is a virtual table,  which does not exist physically rather a logical creation from the existing  physical table. IBM DB2, Microsoft SQL Server, Sybase, Oracle, MySQL and  PostgreSQL are some examples for RDBMS.


What is the difference between Hadoop and an RDBMS?

Hadoop framework works very well with structured and unstructured data. This  also supports variety of data formats in real time such as XML, JSON and text  based flat file formats. However, RDBMS only work with better when an entity  relationship model (ER model) is defined perfectly and therefore, the database  schema or structure can grow and unmanaged otherwise. i.e., An RDBMS works well  with structured data. Hadoop will be a choice in environments such as when there  are needs for BIG data processing on which the data being processed does not  have consistent relationships. Where the data size is too BIG for complex  processing, or not easy to define the relationships between the data, then it  becomes difficult to save the extracted information in an RDBMS with a coherent  relationship.

For example, to analyze Internet data published by various websites. Out of  those existing hundreds of millions of websites, each website has different  types of contents and the relationships between them are not unique. In such  cases, Hadoop is a great choice. Since the exposure of these capabilities  increase, the companies choosing Hadoop not only for help handling the  historically grown BIG data, but also using Hadoop for meeting high performance  needs for new applications.  For eg: Plotting a monthly energy usage of a  customer by comparing between previous months, between his or her neighbors or  even between customers on the same streets. This will bring more awareness, but  running such complex comparison by analyzing large set of data takes several  hours of processing time, and introduction of Hadoop help improving the  computing performance from 10 times to 100 times or more.

RDBMS database technology is a very proven, consistent, matured and highly  supported by world best companies. This works better when the data is  definitions such as data types, relationships among the data, constraints and  etc. Hence, this is more appropriate for real time OLTP processing.



  • RDBMS is relational database management system. Hadoop is node based flat  structure.
  • RDMS is generally used for OLTP processing whereas Hadoop is currently used  for analytical and especially for BIG DATA processing.
  • Any maintenance on storage, or data files, a downtime is needed for any  available RDBMS.  In standalone database systems, to add processing power such  as more CPU, physical memory in non-virtualized environment, a downtime is  needed for RDBMS such as DB2, Oracle, and SQL Server. However, Hadoop systems  are individual independent nodes that can be added in an as needed basis.
  • The database cluster uses the same data files stored in shared storage in  RDBMS systems, whereas the storage data can be stored independently in each  processing node.
  • The performance tuning of an RDBMS can go nightmare. Even in proven  environment. However, Hadoop enables hot tuning by adding extra nodes which will  be self-managed.

This post also helps answering the following questions: What is the  difference between a Hadoop database and a traditional Relational Database? What is the difference between a Hadoop database and a database management  system (DBMS)?

Read more http://www.wikidifference.com/difference-between-hadoop-and-rdbms/

Ref.. http://www.wikidifference.com/difference-between-hadoop-and-rdbms/