Minggu, 26 April 2009

DATABASE NORMALIZATION

WHAT IS NORMALIZATION?
Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.


DATABASE NORMALIZATION
In the field of relational database design, normalization is a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristics-insertion, update, and deletion anomalies-that could lead to a loss of data integrity. E.F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the first normal form in 1970. Codd went on to define the second and third normal forms in 1971; and Codd and Raymond F. Boyce defined the Boyce-Codd normal form in 1974. Higher normal forms were defined by other theorists in subsequent years, the most recent being the sixth normal form introduced by Chris Date, Hugh Darwen, and Nikos Lorentzos in 2002.
Informally, a relational database table (the computerized representation of a relation) is often described as "normalized" if it is in the third normal form (3NF). Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF, 4NF, and 5NF (but typically not 6NF).
A standard piece of database design guidance is that the designer should begin by fully normalizing the design, and selectively denormalize only in places where doing so is absolutely necessary to address performance issues. However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.


DATABASE DESIGN PROCESS

• Gather user needs / business.
• Develop a needs-based ER Model user / business.
• Convert E-R model to the set of relations (tables).
• Normalizationing relations, to remove anomalies.
• Implemented to create a database with a table for each relationship that is have normalizationed.


DATABASE NORMALIZATION

• Normalization is the establishment of the database structure process so that most of the ambiguity can be removed.
• Normalization stage, starting from the most mild (1NF) to most stringent (5NF)
• Usually only up to the level of 3NF or BCNF because already sufficient to generate the table-a table of good quality.
• Why do normalization?
• Optimizing table structures

a. Increase speed
b. To remove income of same data.
c. More efficient in the use of storage media
d. Reduce redundancy
e. Avoid anomalies (insertion anomalies, deletion anomalies, update anomalies).
f. Improved data integrity

• A table saying good (efficient) or if the normal 3 to meet the following criteria:

a. If there is decomposition (decomposition) table, it must be guaranteed safe decomposition it (Lossless-Join Decomposition). That is, after the table is described / decompositioned a new table-table, the table-table can generate a new table with the same exact.

b. Maintain dependence on the functional changes in data (Dependency preservation.

c. Does not violate Boyce-Code Normal Form (BCNF

• If the third criteria (BCNF) can not be met, then at least the table does not violate the Normal Form of the third stage (3rd Normal Form / 3NF).

OBJECTIVES OF NORMALIZATION
A basic objective of the first normal form defined by Codd in 1970 was to permit data to be queried and manipulated using a "universal data sub-language" grounded in first-order logic.[9] (SQL is an example of such a data sub-language, albeit one that Codd regarded as seriously flawed.)[10] Querying and manipulating the data within an unnormalized data structure, such as the following non-1NF representation of customers' credit card transactions, involves more complexity than is really necessary:



To each customer there corresponds a repeating group of transactions. The automated evaluation of any query relating to customers' transactions therefore would broadly involve two stages:
1. Unpacking one or more customers' groups of transactions allowing the individual transactions in a group to be examined, and
2. Deriving a query result based on the results of the first stage
For example, in order to find out the monetary sum of all transactions that occurred in October 2003 for all customers, the system would have to know that it must first unpack the Transactions group of each customer, then sum the Amounts of all transactions thus obtained where the Date of the transaction falls in October 2003.
One of Codd's important insights was that this structural complexity could always be removed completely, leading to much greater power and flexibility in the way queries could be formulated (by users and applications) and evaluated (by the DBMS). The normalized equivalent of the structure above would look like this:



Now each row represents an individual credit card transaction, and the DBMS can obtain the answer of interest, simply by finding all rows with a Date falling in October, and summing their Amounts. All of the values in the data structure are on an equal footing: they are all exposed to the DBMS directly, and can directly participate in queries, whereas in the previous situation some values were embedded in lower-level structures that had to be handled specially. Accordingly, the normalized design lends itself to general-purpose query processing, whereas the unnormalized design does not.


BACKGROUND TO NORMALIZATION: DEFINITIONS

Functional dependency: Attribute B has a functional dependency on attribute A (i.e., A → B) if, for each value of attribute A, there is exactly one value of attribute B. If value of A is repeating in tuples then value of B will also repeat. In our example, Employee Address has a functional dependency on Employee ID, because a particular Employee ID value corresponds to one and only one Employee Address value. (Note that the reverse need not be true: several employees could live at the same address and therefore one Employee Address value could correspond to more than one Employee ID. Employee ID is therefore not functionally dependent on Employee Address.) An attribute may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to determine the extent to which a design is normalized without understanding what functional dependencies apply to the attributes within its tables; understanding this, in turn, requires knowledge of the problem domain. For example, an Employer may require certain employees to split their time between two locations, such as New York City and London, and therefore want to allow Employees to have more than one Employee Address. In this case, Employee Address would no longer be functionally dependent on Employee ID.
Another way to look at the above is by reviewing basic mathematical functions:
Let F(x) be a mathematical function of one independent variable. The independent variable is analogous to the attribute A. The dependent variable (or the dependent attribute using the lingo above), and hence the term functional dependency, is the value of F(A); A is an independent attribute. As we know, mathematical functions can have only one output. Notationally speaking, it is common to express this relationship in mathematics as F(A) = B; or, B → F(A).
There are also functions of more than one independent variable—commonly, this is referred to as multivariable functions. This idea represents an attribute being functionally dependent on a combination of attributes. Hence, F(x,y,z) contains three independent variables, or independent attributes, and one dependent attribute, namely, F(x,y,z). In multivariable functions, there can only be one output, or one dependent variable, or attribute.

Trivial functional dependency
A trivial functional dependency is a functional dependency of an attribute on a superset of itself. {Employee ID, Employee Address} → {Employee Address} is trivial, as is {Employee Address} → {Employee Address}.

Full functional dependency
An attribute is fully functionally dependent on a set of attributes X if it is
• functionally dependent on X, and
• not functionally dependent on any proper subset of X. {Employee Address} has a functional dependency on {Employee ID, Skill}, but not a full functional dependency, because it is also dependent on {Employee ID}.

Transitive dependency
A transitive dependency is an indirect functional dependency, one in which X→Z only by virtue of X→Y and Y→Z.

Multivalued dependency
A multivalued dependency is a constraint according to which the presence of certain rows in a table implies the presence of certain other rows.

Join dependency
A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T.

Superkey
A superkey is an attribute or set of attributes that uniquely identifies rows within a table; in other words, two distinct rows are always guaranteed to have distinct superkeys. {Employee ID, Employee Address, Skill} would be a superkey for the "Employees' Skills" table; {Employee ID, Skill} would also be a superkey.

Candidate key
A candidate key is a minimal superkey, that is, a superkey for which we can say that no proper subset of it is also a superkey. {Employee Id, Skill} would be a candidate key for the "Employees' Skills" table.

Non-prime attribute
A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the "Employees' Skills" table.

Primary key
Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible unique keys. A primary key is a key which the database designer has designated for this purpose.

Normal forms
The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF.
The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n.
Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms.

DENORMALIZATION
Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for Online Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such as updating a sales record at a super market checkout counter. The expectation is that each transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read mostly" databases. OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate business intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema. It has never been proven that this denormalization itself provides any increase in performance, or if the concurrent removal of data constraints is what increases the performance. In many cases, the need for denormalization has waned as computers and RDBMS software have become more powerful, but since data volumes have generally increased along with hardware and software performance, OLAP databases often still use denormalized schemas.
Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift response is crucial.

Non-first normal form (NF² or N1NF)
In recognition that denormalization can be deliberate and useful, the non-first normal form is a definition of database designs which do not conform to the first normal form, by allowing "sets and sets of sets to be attribute domains" (Schek 1982). This extension is a (non-optimal) way of implementing hierarchies in relations. Some academics have dubbed this practitioner developed method, "First Ab-normal Form"; Codd defined a relational database as using relations, so any table not in 1NF could not be considered to be relational.

Consider the following table:

Non-First Normal Form



Assume a person has several favorite colors. Obviously, favorite colors consist of a set of colors modeled by the given table.
To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher normal forms (one would allow "colors" to be its own table). The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another constraint required is for the operators to be bijective, which is covered by the Partitioned Normal Form (PNF).

First Normal Form (1NF)
First normal form (1NF) sets the very basic rules for an organized database:
• Eliminate duplicative columns from the same table.
• Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

Second Normal Form (2NF)
Second normal form (2NF) further addresses the concept of removing duplicative data:
• Meet all the requirements of the first normal form.
• Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
• Create relationships between these new tables and their predecessors through the use of foreign keys.


Third Normal Form (3NF)

Third normal form (3NF) goes one large step further:
• Meet all the requirements of the second normal form.
• Remove columns that are not dependent upon the primary key.

Fourth Normal Form (4NF)
Finally, fourth normal form (4NF) has one additional requirement:
• Meet all the requirements of the third normal form.
• A relation is in 4NF if it has no multi-valued dependencies.
Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.




REFFERENCE

http://en.wikipedia.org/wiki/Database_normalization
http://databases.about.com/od/specificproducts/a/normalization.htm

Tidak ada komentar:

Posting Komentar