'data'에 해당되는 글 3건

  1. 2017.02.01 [Apache Mahout] GenericDataModel 예제코드.
  2. 2014.03.07 11.6 Data Type Storage Requirements
  3. 2014.01.14 [elasticsearch] node master/data ....

[Apache Mahout] GenericDataModel 예제코드.

ITWeb/개발일반 2017.02.01 11:52

Apache Mahout 의 DataModel 구현체는 아래 프로젝트의 패키지에 포함이 되어 있습니다.


- mahout-mr 


- org.apache.mahout.cf.taste.impl.model.*


FastByIDMap<PreferenceArray> result = new FastByIDMap<PreferenceArray>();
List<Preference> prefsList = Lists.newArrayList();
prefsList.add(new GenericPreference(1645390, 123456, 0.4));
result.put(1645390, new GenericUserPreferenceArray(prefsList));

return new ExampleRecommender(new GenericDataModel(result));

public GenericPreference(long userID, long itemID, float value)

코드 자체가 너무 쉬워서 이만 줄입니다.

저작자 표시 비영리 변경 금지
크리에이티브 커먼즈 라이선스
Creative Commons License
Trackback 0 : Comment 0

11.6 Data Type Storage Requirements

ITWeb/개발일반 2014.03.07 17:27


11.6 Data Type Storage Requirements

The storage requirements for data vary, according to the storage engine being used for the table in question. Different storage engines use different methods for recording the raw data and different data types. In addition, some engines may compress the information in a given row, either on a column or entire row basis, making calculation of the storage requirements for a given table or column structure.

However, all storage engines must communicate and exchange information on a given row within a table using the same structure, and this information is consistent, irrespective of the storage engine used to write the information to disk.

This sections includes some guideliness and information for the storage requirements for each data type supported by MySQL, including details for the internal format and the sizes used by storage engines that used a fixed size representation for different types. Information is listed by category or storage engine.

The internal representation of a table has a maximum row size of 65,535 bytes, even if the storage engine is capable of supporting larger rows. This figure excludes BLOB or TEXT columns, which contribute only 9 to 12 bytes toward this size. For BLOB and TEXT data, the information is stored internally in a different area of memory than the row buffer. Different storage engines handle the allocation and storage of this data in different ways, according to the method they use for handling the corresponding types. For more information, see Chapter 14, Storage Engines, and Section E.7.4, “Limits on Table Column Count and Row Size”.

Storage Requirements for InnoDB Tables

See Section, “Physical Row Structure” for information about storage requirements for InnoDB tables.

Storage Requirements for NDBCLUSTER Tables


For tables using the NDBCLUSTER storage engine, there is the factor of 4-byte alignment to be taken into account when calculating storage requirements. This means that all NDB data storage is done in multiples of 4 bytes. Thus, a column value that would take 15 bytes in a table using a storage engine other than NDB requires 16 bytes in an NDB table. This requirement applies in addition to any other considerations that are discussed in this section. For example, in NDBCLUSTER tables, the TINYINT, SMALLINT, MEDIUMINT, and INTEGER (INT) column types each require 4 bytes storage per record due to the alignment factor.

An exception to this rule is the BIT type, which is not 4-byte aligned. In MySQL Cluster tables, a BIT(M) column takes M bits of storage space. However, if a table definition contains 1 or more BIT columns (up to 32 BIT columns), then NDBCLUSTER reserves 4 bytes (32 bits) per row for these. If a table definition contains more than 32 BIT columns (up to 64 such columns), then NDBCLUSTER reserves 8 bytes (that is, 64 bits) per row.

In addition, while a NULL itself does not require any storage space, NDBCLUSTER reserves 4 bytes per row if the table definition contains any columns defined as NULL, up to 32 NULL columns. (If a MySQL Cluster table is defined with more than 32 NULL columns up to 64 NULL columns, then 8 bytes per row is reserved.)

When calculating storage requirements for MySQL Cluster tables, you must also remember that every table using the NDBCLUSTER storage engine requires a primary key; if no primary key is defined by the user, then a hidden primary key will be created by NDB. This hidden primary key consumes 31-35 bytes per table record.

You may find the ndb_size.pl utility to be useful for estimating NDB storage requirements. This Perl script connects to a current MySQL (non-Cluster) database and creates a report on how much space that database would require if it used the NDBCLUSTER storage engine. See Section 17.4.23, “ndb_size.pl — NDBCLUSTER Size Requirement Estimator”, for more information.

Storage Requirements for Numeric Types

Data TypeStorage Required
BIGINT8 bytes
FLOAT(p)4 bytes if 0 <= p <= 24, 8 bytes if 25 <= p <= 53
FLOAT4 bytes
DECIMAL(M,D), NUMERIC(M,D)Varies; see following discussion
BIT(M)approximately (M+7)/8 bytes

Values for DECIMAL (and NUMERIC) columns are represented using a binary format that packs nine decimal (base 10) digits into four bytes. Storage for the integer and fractional parts of each value are determined separately. Each multiple of nine digits requires four bytes, and the leftover digits require some fraction of four bytes. The storage required for excess digits is given by the following table.

Leftover DigitsNumber of Bytes

Storage Requirements for Date and Time Types

Data TypeStorage Required
DATE3 bytes
TIME3 bytes
YEAR1 byte

For details about internal representation of temporal values, see MySQL Internals: Important Algorithms and Structures.

Storage Requirements for String Types

In the following table, M represents the declared column length in characters for nonbinary string types and bytes for binary string types. L represents the actual length in bytes of a given string value.

Data TypeStorage Required
CHAR(M)M × w bytes, 0 <= M <= 255, where w is the number of bytes required for the maximum-length character in the character set. See Section, “Physical Row Structure” for information about CHAR data type storage requirements for InnoDB tables.
BINARY(M)M bytes, 0 <= M <= 255
VARCHAR(M), VARBINARY(M)L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes
TINYBLOB, TINYTEXTL + 1 bytes, where L < 28
BLOB, TEXTL + 2 bytes, where L < 216
MEDIUMBLOB, MEDIUMTEXTL + 3 bytes, where L < 224
LONGBLOB, LONGTEXTL + 4 bytes, where L < 232
ENUM('value1','value2',...)1 or 2 bytes, depending on the number of enumeration values (65,535 values maximum)
SET('value1','value2',...)1, 2, 3, 4, or 8 bytes, depending on the number of set members (64 members maximum)

Variable-length string types are stored using a length prefix plus data. The length prefix requires from one to four bytes depending on the data type, and the value of the prefix is L (the byte length of the string). For example, storage for a MEDIUMTEXT value requires L bytes to store the value plus three bytes to store the length of the value.

To calculate the number of bytes used to store a particular CHAR, VARCHAR, or TEXT column value, you must take into account the character set used for that column and whether the value contains multi-byte characters. In particular, when using the utf8 Unicode character set, you must keep in mind that not all characters use the same number of bytes and can require up to three bytes per character. For a breakdown of the storage used for different categories of utf8 characters, see Section 10.1.10, “Unicode Support”.

VARCHAR, VARBINARY, and the BLOB and TEXT types are variable-length types. For each, the storage requirements depend on these factors:

  • The actual length of the column value

  • The column's maximum possible length

  • The character set used for the column, because some character sets contain multi-byte characters

For example, a VARCHAR(255) column can hold a string with a maximum length of 255 characters. Assuming that the column uses the latin1 character set (one byte per character), the actual storage required is the length of the string (L), plus one byte to record the length of the string. For the string 'abcd', L is 4 and the storage requirement is five bytes. If the same column is instead declared to use the ucs2 double-byte character set, the storage requirement is 10 bytes: The length of 'abcd' is eight bytes and the column requires two bytes to store lengths because the maximum length is greater than 255 (up to 510 bytes).

The effective maximum number of bytes that can be stored in a VARCHAR or VARBINARY column is subject to the maximum row size of 65,535 bytes, which is shared among all columns. For a VARCHAR column that stores multi-byte characters, the effective maximum number of characters is less. For example, utf8 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8 character set can be declared to be a maximum of 21,844 characters. See Section E.7.4, “Limits on Table Column Count and Row Size”.

The NDBCLUSTER storage engine in MySQL 5.1 supports variable-width columns. This means that a VARCHAR column in a MySQL Cluster table requires the same amount of storage as it would using any other storage engine, with the exception that such values are 4-byte aligned. Thus, the string 'abcd' stored in a VARCHAR(50) column using the latin1 character set requires 8 bytes (rather than 6 bytes for the same column value in a MyISAM table). This represents a change in behavior from earlier versions of NDBCLUSTER, where a VARCHAR(50) column would require 52 bytes storage per record regardless of the length of the string being stored.

TEXT and BLOB columns are implemented differently in the NDB Cluster storage engine, wherein each row in a TEXT column is made up of two separate parts. One of these is of fixed size (256 bytes), and is actually stored in the original table. The other consists of any data in excess of 256 bytes, which is stored in a hidden table. The rows in this second table are always 2,000 bytes long. This means that the size of a TEXT column is 256 if size <= 256 (where size represents the size of the row); otherwise, the size is 256 + size + (2000 – (size – 256) % 2000).

The size of an ENUM object is determined by the number of different enumeration values. One byte is used for enumerations with up to 255 possible values. Two bytes are used for enumerations having between 256 and 65,535 possible values. See Section 11.4.4, “The ENUM Type”.

The size of a SET object is determined by the number of different set members. If the set size is N, the object occupies (N+7)/8 bytes, rounded up to 1, 2, 3, 4, or 8 bytes. A SET can have a maximum of 64 members. See Section 11.4.5, “The SET Type”.

크리에이티브 커먼즈 라이선스
Creative Commons License
tags : data, MySQL, size, type
Trackback 0 : Comment 0

[elasticsearch] node master/data ....

Elastic/Elasticsearch 2014.01.14 17:34

참고글 : http://stackoverflow.com/questions/15019821/what-differents-between-master-node-gateway-and-other-node-gateway-in-elasticsea

참고 하시라고 올려 봅니다.


The master node is the same as any other node in the cluster, except that it has been elected to be the master.

It is responsible for coordinating any cluster-wide changes, such as as the addition or removal of a node, creation, deletion or change of state (ie open/close) of an index, and the allocation of shards to nodes. When any of these changes occur, the "cluster state" is updated by the master and published to all other nodes in the cluster. It is the only node that may publish a new cluster state.

The tasks that a master performs are lightweight. Any tasks that deal with data (eg indexing, searching etc) do not need to involve the master. If you choose to run the master as a non-data node (ie a node that acts as master and as a router, but doesn't contain any data) then the master can run happily on a smallish box.

A node is allowed to become a master if it is marked as "master eligible" (which all nodes are by default). If the current master goes down, a new master will be elected by the cluster.

An important configuration option in your cluster is minimum_master_nodes. This specifies the number of "master eligible" nodes that a node must be able to see in order to be part of a cluster. Its purpose is to avoid "split brain" ie having the cluster separate into two clusters, both of which think that they are functioning correctly.

For instance, if you have 3 nodes, all of which are master eligible, and set minimum_master_nodes to 1, then if the third node is separated from the other two it, it still sees one master-eligible node (itself) and thinks that it can form a cluster by itself.

Instead, set minimum_master_nodes to 2 in this case (number of nodes / 2 + 1), then if the third node separates, it won't see enough master nodes, and thus won't form a cluster by itself. It will keep trying to join the original cluster.

While Elasticsearch tries very hard to choose the correct defaults, minimum_master_nodes is impossible to guess, as it has no way of knowing how many nodes you intend to run. This is something you must configure yourself.

[구글 번역]

마스터 노드 는마스터로 선출 되었음을 제외하고 ,클러스터의 다른 노드 와 동일하다.

그런 노드를 생성, 삭제 또는 상태 의 변화지수 (즉, 개방 / 폐쇄 ) 의 추가 또는 제거와 같은 같은 클러스터 전체의 변화 , 그리고 노드에 파편 의 할당을 조정하는 책임이 있습니다. 이러한 변경 사항 이 발생하면 ," 클러스터 상태 는 " 마스터에 의해 업데이트 및 클러스터의 다른 모든 노드 에 게시됩니다. 그것은 새로운 클러스터 상태 를 게시 할 수 있는 유일한 노드입니다.

마스터 수행 이 경량작업 . 데이터 ( 예를 들어, 인덱싱, 검색 등 ) 를 다루는 모든 작업은 마스터 를 포함 할 필요가 없습니다. 가 아닌 데이터 노드로 마스터 를 실행하는 (즉, 마스터 와 라우터 역할을하는 노드 , 그러나 어떤 데이터를 포함하지 않음) 을 선택하면 다음 마스터는 작은 상자 에 즐겁게 실행할 수 있습니다.

노드 가 "마스터 자격 "( 모든 노드가 기본적으로 되는 ) 으로 표시된 경우주인이 될 수 있다. 현재 마스터 가 다운되면 새 마스터 는 클러스터 에 의해 선출 됩니다.

클러스터의중요한 구성 옵션은 minimum_master_nodes 입니다 . 이것은노드가클러스터의 일부가 되기 위해서는 볼 수 있어야 " 마스터 적격 " 노드 수를 지정한다. 그 목적은 , 즉 클러스터가 올바르게 작동하고 있는지 생각 둘 다 두 개의 클러스터 로 분리 하는 데 " 분할 뇌 "를 방지하는 것입니다 .

세 번째 노드가 다른 두 그것에서 분리 되었을 경우 , 마스터 자격 , 1로 minimum_master_nodes 설정 모두 3 노드 , 이 경우 예를 들어 , 그것은 여전히 ​​하나의 마스터 자격 노드를 본다 ( 자체 ) 과 생각 이 그것을 자체적으로클러스터를 형성 할 수있다.

대 신에,제 3 노드 는 분리 하는 경우 , 충분히 마스터 노드를 참조 하지 않으며 따라서 자체적으로클러스터를 형성하지 않을 것이다 그리고, 이 경우에는 ( 노드 / 2 + 1 의 수) 의 2 로 minimum_master_nodes 세트 . 원래 클러스터를 결합 하려고 노력하고 있습니다.

Elasticsearch 올바른 기본값을 선택 하는 것은 매우 어려운 시도하는 동안 당신이 실행하려는 노드 수를 알 수있는 방법이 없기 때문에 , minimum_master_nodes 는 추측하기 불가능하다. 이것은 당신이 자신을 구성해야합니다 무언가이다.

크리에이티브 커먼즈 라이선스
Creative Commons License
Trackback 0 : Comment 0

티스토리 툴바