Atlas类型系统

发布时间 2024-01-08 13:41:20作者: 粒子先生

预定义系统类型

Atlas自带了一些预定义的系统类型。我们在前面的部分中看到了一个示例(DataSet)。在本节中,我们将看到更多这些类型并了解它们的重要性。

  • Referenceable:该类型表示可以使用名为qualifiedName的唯一属性搜索的所有实体。
{
            "category": "ENTITY",
            "guid": "0e8a1c75-53ee-41d1-9bb7-8de5d79c8f24",
            "createdBy": "hadoop",
            "updatedBy": "hadoop",
            "createTime": 1615972678324,
            "updateTime": 1615972753274,
            "version": 4,
            "name": "Referenceable",
            "description": "Referenceable",
            "typeVersion": "1.3",
            "serviceType": "atlas_core",
            "attributeDefs": [
                {
                    "name": "qualifiedName",
                    "typeName": "string",
                    "isOptional": false,
                    "cardinality": "SINGLE",
                    "valuesMinCount": 1,
                    "valuesMaxCount": 1,
                    "isUnique": true,
                    "isIndexable": true,
                    "includeInNotification": false,
                    "searchWeight": 10
                },
                {
                    "name": "replicatedFrom",
                    "typeName": "array<AtlasServer>",
                    "isOptional": true,
                    "cardinality": "SET",
                    "valuesMinCount": 0,
                    "valuesMaxCount": 2147483647,
                    "isUnique": false,
                    "isIndexable": false,
                    "includeInNotification": false,
                    "searchWeight": -1,
                    "options": {
                        "isSoftReference": "true"
                    }
                },
                {
                    "name": "replicatedTo",
                    "typeName": "array<AtlasServer>",
                    "isOptional": true,
                    "cardinality": "SET",
                    "valuesMinCount": 0,
                    "valuesMaxCount": 2147483647,
                    "isUnique": false,
                    "isIndexable": false,
                    "includeInNotification": false,
                    "searchWeight": -1,
                    "options": {
                        "isSoftReference": "true"
                    }
                }
            ],
            "superTypes": [],
            "subTypes": [
                "spark_storagedesc",
                "hive_storagedesc",
                "Asset",
                "ddl"
            ],
            "relationshipAttributeDefs": [
                {
                    "name": "meanings",
                    "typeName": "array<AtlasGlossaryTerm>",
                    "isOptional": true,
                    "cardinality": "SET",
                    "valuesMinCount": -1,
                    "valuesMaxCount": -1,
                    "isUnique": false,
                    "isIndexable": false,
                    "includeInNotification": false,
                    "searchWeight": -1,
                    "relationshipTypeName": "AtlasGlossarySemanticAssignment",
                    "isLegacyAttribute": false
                }
            ],
            "businessAttributeDefs": {}
        }

 

  • Asset:该类型扩展了Referenceable并添加了名称,描述和所有者等属性。 Name是必需属性(isOptional = false),其他属性是可选的。Referenceable和Asset的目的是为建模者提供在定义和查询自己类型的实体时强制一致性的方法。拥有这些固定的属性集允许应用程序和用户界面基于约定做出关于默认情况下它们可以期望类型的属性的假设。
{
            "category": "ENTITY",
            "guid": "6a515760-42e3-458f-be69-e2e9da99b2e8",
            "createdBy": "hadoop",
            "updatedBy": "hadoop",
            "createTime": 1615972679056,
            "updateTime": 1615972754341,
            "version": 6,
            "name": "Asset",
            "description": "Asset",
            "typeVersion": "1.6",
            "serviceType": "atlas_core",
            "attributeDefs": [
                {
                    "name": "name",
                    "typeName": "string",
                    "isOptional": false,
                    "cardinality": "SINGLE",
                    "valuesMinCount": 1,
                    "valuesMaxCount": 1,
                    "isUnique": false,
                    "isIndexable": true,
                    "includeInNotification": false,
                    "searchWeight": 10,
                    "indexType": "STRING"
                },
                {
                    "name": "description",
                    "typeName": "string",
                    "isOptional": true,
                    "cardinality": "SINGLE",
                    "valuesMinCount": 0,
                    "valuesMaxCount": 1,
                    "isUnique": false,
                    "isIndexable": false,
                    "includeInNotification": false,
                    "searchWeight": 9
                },
                {
                    "name": "owner",
                    "typeName": "string",
                    "isOptional": true,
                    "cardinality": "SINGLE",
                    "valuesMinCount": 0,
                    "valuesMaxCount": 1,
                    "isUnique": false,
                    "isIndexable": true,
                    "includeInNotification": false,
                    "searchWeight": 9,
                    "indexType": "STRING"
                },
                {
                    "name": "displayName",
                    "typeName": "string",
                    "isOptional": true,
                    "cardinality": "SINGLE",
                    "valuesMinCount": 0,
                    "valuesMaxCount": 1,
                    "isUnique": false,
                    "isIndexable": true,
                    "includeInNotification": false,
                    "searchWeight": -1,
                    "indexType": "STRING"
                },
                {
                    "name": "userDescription",
                    "typeName": "string",
                    "isOptional": true,
                    "cardinality": "SINGLE",
                    "valuesMinCount": 0,
                    "valuesMaxCount": 1,
                    "isUnique": false,
                    "isIndexable": true,
                    "includeInNotification": false,
                    "searchWeight": -1,
                    "indexType": "STRING"
                }
            ],
            "superTypes": [
                "Referenceable"
            ],
            "subTypes": [
                "DataSet",
                "ProcessExecution",
                "adls_gen2_account",
                "Infrastructure",
                "Process",
                "hive_db",
                "hbase_namespace"
            ],
            "relationshipAttributeDefs": [
                {
                    "name": "meanings",
                    "typeName": "array<AtlasGlossaryTerm>",
                    "isOptional": true,
                    "cardinality": "SET",
                    "valuesMinCount": -1,
                    "valuesMaxCount": -1,
                    "isUnique": false,
                    "isIndexable": false,
                    "includeInNotification": false,
                    "searchWeight": -1,
                    "relationshipTypeName": "AtlasGlossarySemanticAssignment",
                    "isLegacyAttribute": false
                }
            ],
            "businessAttributeDefs": {}
        }

 

  • Infrastructure:该类型继承自Asset,通常可用作基础结构元数据对象(如集群,主机等)的常见超类型。
{
            "category": "ENTITY",
            "guid": "d0ab486f-3e35-4618-a183-419e5e74ac70",
            "createdBy": "hadoop",
            "updatedBy": "hadoop",
            "createTime": 1615972679137,
            "updateTime": 1615972749472,
            "version": 2,
            "name": "Infrastructure",
            "description": "Infrastructure can be IT infrastructure, which contains hosts and servers. Infrastructure might not be IT orientated, such as 'Car' for IoT applications.",
            "typeVersion": "1.2",
            "serviceType": "atlas_core",
            "attributeDefs": [],
            "superTypes": [
                "Asset"
            ],
            "subTypes": [
                "falcon_cluster"
            ],
            "relationshipAttributeDefs": [
                {
                    "name": "meanings",
                    "typeName": "array<AtlasGlossaryTerm>",
                    "isOptional": true,
                    "cardinality": "SET",
                    "valuesMinCount": -1,
                    "valuesMaxCount": -1,
                    "isUnique": false,
                    "isIndexable": false,
                    "includeInNotification": false,
                    "searchWeight": -1,
                    "relationshipTypeName": "AtlasGlossarySemanticAssignment",
                    "isLegacyAttribute": false
                }
            ],
            "businessAttributeDefs": {}
        }
  • DataSet:该类型继承自Referenceable。从概念上讲,它可以用于表示存储数据的类型。在Atlas中,hive表,hbase_tables等都是从DataSet扩展的类型。扩展DataSet的类型可以预期具有Schema,因为它们具有定义该数据集的属性的属性。对于例如hive_table中的columns属性。此外,扩展DataSet的类型实体参与数据转换,Atlas可以通过血缘)图了解到转换过程。
{
    "category": "ENTITY",
    "guid": "53230ff2-a233-4f3d-8af6-9a595e64dc75",
    "createdBy": "hadoop",
    "updatedBy": "hadoop",
    "createTime": 1615972679128,
    "updateTime": 1615973070757,
    "version": 3,
    "name": "DataSet",
    "description": "DataSet",
    "typeVersion": "1.2",
    "serviceType": "atlas_core",
    "attributeDefs": [],
    "superTypes": [
        "Asset"
    ],
    "subTypes": [
        "adls_gen2_container",
        "rdbms_foreign_key",
        "StorageDesc",
        "spark_ml_directory",
        "ozone_volume",
        "hive_table",
        "spark_column",
        "aws_s3_pseudo_dir",
        "sqoop_dbdatastore",
        "hbase_column",
        "rdbms_instance",
        "spark_table",
        "falcon_feed",
        "jms_topic",
        "hbase_table",
        "Column",
        "rdbms_table",
        "rdbms_column",
        "hbase_column_family",
        "hive_column",
        "rdbms_db",
        "Table",
        "ml_model_deployment",
        "spark_ml_pipeline",
        "ozone_bucket",
        "kafka_topic",
        "adls_gen2_blob",
        "View",
        "spark_ml_model",
        "aws_s3_v2_base",
        "adls_gen2_directory",
        "rdbms_index",
        "ml_project",
        "ozone_key",
        "aws_s3_bucket",
        "aws_s3_object",
        "avro_type",
        "ml_model_build",
        "DB",
        "fs_path",
        "spark_db"
    ],
    "relationshipAttributeDefs": [
        {
            "name": "inputToProcesses",
            "typeName": "array<Process>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": -1,
            "valuesMaxCount": -1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "dataset_process_inputs",
            "isLegacyAttribute": false
        },
        {
            "name": "pipeline",
            "typeName": "spark_ml_pipeline",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": -1,
            "valuesMaxCount": -1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "spark_ml_pipeline_dataset",
            "isLegacyAttribute": false
        },
        {
            "name": "schema",
            "typeName": "array<avro_schema>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": -1,
            "valuesMaxCount": -1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "avro_schema_associatedEntities",
            "isLegacyAttribute": false
        },
        {
            "name": "model",
            "typeName": "spark_ml_model",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": -1,
            "valuesMaxCount": -1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "spark_ml_model_dataset",
            "isLegacyAttribute": false
        },
        {
            "name": "meanings",
            "typeName": "array<AtlasGlossaryTerm>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": -1,
            "valuesMaxCount": -1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "AtlasGlossarySemanticAssignment",
            "isLegacyAttribute": false
        },
        {
            "name": "outputFromProcesses",
            "typeName": "array<Process>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": -1,
            "valuesMaxCount": -1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "process_dataset_outputs",
            "isLegacyAttribute": false
        }
    ],
    "businessAttributeDefs": {}
}
  • Process:该类型继承自Asset。从概念上讲,它可以用于表示任何数据转换操作。例如,将具有原始数据的配置单元表转换为存储某些聚合的另一个配置单元表的ETL过程可以是扩展Process类型的特定类型。流程类型有两个特定属性,即输入和输出。输入和输出都是DataSet实体的数组。因此,Process类型的实例可以使用这些输入和输出来捕获DataSet的血缘如何演变。
{
    "category": "ENTITY",
    "guid": "f77d14ad-ba0d-4e37-91c5-b657c5e3321b",
    "createdBy": "hadoop",
    "updatedBy": "hadoop",
    "createTime": 1615972679838,
    "updateTime": 1615972749654,
    "version": 2,
    "name": "Process",
    "description": "Process",
    "typeVersion": "1.2",
    "serviceType": "atlas_core",
    "attributeDefs": [
        {
            "name": "inputs",
            "typeName": "array<DataSet>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1
        },
        {
            "name": "outputs",
            "typeName": "array<DataSet>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1
        }
    ],
    "superTypes": [
        "Asset"
    ],
    "subTypes": [
        "falcon_feed_replication",
        "falcon_process",
        "spark_column_lineage",
        "falcon_feed_creation",
        "ml_model_train_build_process",
        "spark_process",
        "ml_model_deploy_process",
        "ml_project_create_process",
        "hive_process",
        "LoadProcess",
        "impala_process",
        "impala_column_lineage",
        "spark_application",
        "sqoop_process",
        "hive_column_lineage",
        "storm_topology"
    ],
    "relationshipAttributeDefs": [
        {
            "name": "outputs",
            "typeName": "array<DataSet>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "process_dataset_outputs",
            "isLegacyAttribute": true
        },
        {
            "name": "inputs",
            "typeName": "array<DataSet>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "dataset_process_inputs",
            "isLegacyAttribute": true
        },
        {
            "name": "meanings",
            "typeName": "array<AtlasGlossaryTerm>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": -1,
            "valuesMaxCount": -1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "searchWeight": -1,
            "relationshipTypeName": "AtlasGlossarySemanticAssignment",
            "isLegacyAttribute": false
        }
    ],
    "businessAttributeDefs": {}
}

Atlas允许用户定义元数据对象(metadata object)模型,即CWM(Common Warehouse Meta model,公共仓库元模型)中的M2。该模型由称之为“类型”(type)的东西组成。类型的实例称之为“实体”(entity),它便是实际待管理的元数据对象。类型系统是一个提供给用户定义和管理类型与实体的组件。所有Atlas中可管理的、立即可用的元数据对象都是由特定类型模型下的实体开描述的。为了在Atlas中存储新类型的元数据,我们需要理解类型系统组件中的概念。

类型(type)

在Atlas中“类型”是对一个特定类别元数据对象如何存储和访问的定义。一个类型描述了一项属性或一组属性集合,而这些属性定义了元数据对象所包含的内容。有开发背景的用户可将类型类比理解为面向对象编程语言中的类,或者是关系型数据库中的表模式(table schema)。

下面是在Atlas中定义的一个名为Hive表的类型,该类型包含了以下的属性:

Name:         hive_table
TypeCategory: Entity
SuperTypes:   DataSet
Attributes:
    name:             string
    db:               hive_db
    owner:            string
    createTime:       date
    lastAccessTime:   date
    comment:          string
    retention:        int
    sd:               hive_storagedesc
    partitionKeys:    array<hive_column>
    aliases:          array<string>
    columns:          array<hive_column>
    parameters:       map<string,string>
    viewOriginalText: string
    viewExpandedText: string
    tableType:        string
    temporary:        boolean

上述示例的关键点:

  • Atlas中每一个类型都通过唯一的名字来标识。
  • 每一个类型都有一个元类型(metatype)。Atlas有如下的元类型:
    1.基本元类型:boolean, byte, short, int, long, float, double, biginteger, bigdecimal, string, date
    2.枚举元类型(Enum metatypes)
    3.集合元类型:array, map
    4.复合元类型:Entity, Struct, Classification, Relationship
  • Entity & Classification可以继承其他类型,被继承的类型称为supertype。这样做的好处是,类型可以继承获得超类型的属性。建模者可以将一些公共属性定义在超类型中。比如示例中hive表就是继承自DataSet。
  • 具有‘Entity’, ‘Struct’, ‘Classification’ or 'Relationship'元类型的类型可拥有一个属性集合。其中每一项属性都有一个名词及与之对应的取值。属性可通过表达式type_name.attribute_name来引用。比如hive_table.name是String,hive_table.aliases是String数组,hive_table.db引用了hive_db类型的一个实例 。

实体

Atlas中的一个“实体”是一个实体类型的特殊值或实例,也表征了真实世界中的元数据对象。类比面向对象编程语言,一个实例是特定类的一个对象。
一个Hive表的实例就是一个实体。假设在“default”数据库中有个名为“customers”的hive表。该表就是hive表类型的一个实体。

guid:     "9ba387dd-fa76-429c-b791-ffc338d3c91f"
typeName: "hive_table"
status:   "ACTIVE"
values:
    name:             “customers”
    db:               { "guid": "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc", "typeName": "hive_db" }
    owner:            “admin”
    createTime:       1490761686029
    updateTime:       1516298102877
    comment:          null
    retention:        0
    sd:               { "guid": "ff58025f-6854-4195-9f75-3a3058dd8dcf", "typeName": "hive_storagedesc" }
    partitionKeys:    null
    aliases:          null
    columns:          [ { "guid": ""65e2204f-6a23-4130-934a-9679af6a211f", "typeName": "hive_column" }, { "guid": ""d726de70-faca-46fb-9c99-cf04f6b579a6", "typeName": "hive_column" }, ...]
    parameters:       { "transient_lastDdlTime": "1466403208"}
    viewOriginalText: null
    viewExpandedText: null
    tableType:        “MANAGED_TABLE”
    temporary:        false

以上示例中的要点:

  • 每一个实体类型的实例都有一个唯一标识,即GUID。该GUID是在(元数据)对象定义时由Atlas服务生成,并且在该对象的生命周期内以常量方式保存。任何时刻,我们都能用GUID来访问这个对象。
  • 一个实体实例的values是一个map,该map的key为对应类型中定义属性的名称,value为属性的取值。
  • 属性的取值必须与类型中定义的属性类型保持一致。实体类型(Entity-type)的属性拥有一个AtlasObjectId类型的取值。

实体(Entity)与结构(Struct)元类型都是由其他类型属性组合而成。然而,实例类型的实例拥有一个标识(GUID值),可以被其他实例引用(比如hive_db实体引用hive_table实体)。结构类型没有标识。结构类型的值是所有属性的集合。

属性

一个属性拥有以下的内容:

"name": "type",
"typeName": "string",
"isOptional": false,
"cardinality": "SINGLE",
"valuesMinCount": 1,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": true,
"includeInNotification": false,
"searchWeight": -1

name:该属性名

typeName:该属性类型,包括基本类型,以及date,各种type类型,和集合类型等等

isOptional:是否可选,false表示该属性必须指定

cardinality:如下图三种,SINGLE(单个),LIST(可重复多个),SET(不可重复多个)

valuesMinCount:该属性最小个数

valuesMaxCount:该属性最大个数

isUnique:是否为唯一属性

此标志与索引相关。如果指定为唯一,这意味着为 JanusGraph 中的此属性创建一个特殊索引,允许基于等式的查找。

具有此标志的真实值的任何属性都被视为主键,以将此实体与其他实体区分开。因此,应注意确保此属性在现实世界中模拟独特的属性。

例如,考虑 hive_table 的 name 属性。孤立地,名称不是 hive_table 的唯一属性,因为具有相同名称的表可以存在于多个数据库中。如果 Atlas 在多个集群中存储 hive 表的元数据,即使一对(数据库名称,表名称)也不是唯一的。只有集群位置,数据库名称和表名称可以在物理世界中被视为唯一。

isIndexable:此标志指示此属性是否应该索引,以便可以使用属性值作为谓词来执行查找,并且可以高效地执行查找。

constraints:限制类型,该属性的限制类型,猜测可以通过该值来实现类似于MySQL中外键的功能,默认值有如下3个

在这里插入图片描述


  • name - 属性的名称。
  • typeName - 属性的元类型名称。
  • isIndexable - 标识是否在该属性上建立了索引
  • isUnique - 是否时唯一索引。任何标记设置为真的属性,都可以作为区分实体的主键。

如果isOptional=true,那么创建表实体时必须可以引用到db实体

db:
    "name":        "db",
    "typeName":    "hive_db",
    "isOptional":  false,
    "isIndexable": true,
    "isUnique":    false,
    "cardinality": "SINGLE"

注意到列属性定义中的ownedRef约束。这使得我们在定义列实体必须与对应的表实体绑定起来。

columns:
    "name":        "columns",
    "typeName":    "array<hive_column>",
    "isOptional":  optional,
    "isIndexable": true,
    “isUnique":    false,
    "constraints": [ { "type": "ownedRef" } ]

系统特殊类型和它们的重要性

模型关系

/**
 * The Relationship category determines the style of relationship around containment and lifecycle.
 * UML terminology is used for the values.
 * <p>
 * ASSOCIATION is a relationship with no containment. <br>
 * COMPOSITION and AGGREGATION are containment relationships.
 * <p>
 * The difference being in the lifecycles of the container and its children. In the COMPOSITION case,
 * the children cannot exist without the container. For AGGREGATION, the life cycles
 * of the container and children are totally independant.
 */
public enum RelationshipCategory {
    ASSOCIATION, AGGREGATION, COMPOSITION
};
 

/**
 * PropagateTags indicates whether tags should propagate across the relationship instance.
 * <p>
 * Tags can propagate:
 * <p>
 * NONE - not at all <br>
 * ONE_TO_TWO - from end 1 to 2 <br>
 * TWO_TO_ONE - from end 2 to 1 <br>
 * BOTH - both ways
 * <p>
 * Care needs to be taken when specifying. The use cases we are aware of where this flag is useful:
 * <p>
 * - propagating confidentiality classifications from a table to columns - ONE_TO_TWO could be used here <br>
 * - propagating classifications around Glossary synonyms - BOTH could be used here.
 * <p>
 * There is an expectation that further enhancements will allow more granular control of tag propagation and will
 * address how to resolve conflicts.
 */
public enum PropagateTags {
    NONE, ONE_TO_TWO, TWO_TO_ONE, BOTH
};

我在设计元数据的类型TypeDef时,我遇到了一个问题。

2021-02-03 17:00:39,471 ERROR - [main:] ~ graph rollback due to exception  (GraphTransactionInterceptor:167)
org.apache.atlas.exception.AtlasBaseException: AGGREGATION relationshipDef attachments creation attempted without an end specifying isContainer
        at org.apache.atlas.type.AtlasRelationshipType.validateAtlasRelationshipDef(AtlasRelationshipType.java:309)

原来啊,我定义了一个关系类型为AGGREGATION,但是没有定义谁是Container。如果没有一个强制的谁包含谁,我们应该把关系分类改为ASSOCIATION。

"relationshipCategory": "AGGREGATION",
"endDef1": {
  "type": "DataSet",
  "name": "files",
  "isContainer": false,
  "cardinality": "SET"
},
"endDef2": {
  "type": "DataSet",
  "name": "files",
  "isContainer": false,
  "cardinality": "SET"
}

Atlas的类型之间有相互关系。TypeDef中通过relationshipDefs定义了这些关系。

relationship中有如下属性:

  1. relationshipCategory关系类型:
  2. ASSOCIATION:关联关系,没有容器存在,1对1
  3. AGGREGATION:容器关系,1对多,而且彼此可以相互独立存在
  4. COMPOSITION:容器关系,1对多,但是容器中的实例不能脱离容器存在
  5. propagateTags:推导tag
  6. NONE:是否进行tag推导
  7. ONE_TO_TWO: 1到2推导
  8. TWO_TO_ONE:2到1推导
  9. BOTH:双向传递

enddef有如下属性:

  • cardinality: 三种类型SINGLE, LIST, SET
  • isContainer: 这一头是不是容器
"relationshipDefs": [
    {
      "category": "RELATIONSHIP",
      "name": "sample_Table_DB",
      "description": "sample_Table_DB",
      "typeVersion": "1.0",
      "relationshipCategory": "AGGREGATION",
      "propagateTags": "NONE",
      "endDef1": {
        "type": "sample_table_type",
        "name": "db",
        "cardinality": "SINGLE",
        "isContainer": false,
        "isLegacyAttribute": false
      },
      "endDef2": {
        "type": "sample_db_type",
        "name": "tables",
        "cardinality": "SET",
        "isContainer": true,
        "isLegacyAttribute": false
      }
    },
    ...
 }

Type:类型。在这里我的理解Type是类似于Java等面向对象编程中的类。

Entity:实例,在这里我的理解Entity是类似于Java等面向对象编程中的对象。

但是,光看看Atlas的文档实在是不够的,我们来看点实例好了。

在Atlas源代码中,就有例子json文件定义了一些类型。

typedef_create.json

最核心的部分是EntityDef类型。entityDefs条目下对应是一个EntityDef数组,每个Entity有几部分。

  1. Category属性有以下多种: PRIMITIVE, OBJECT_ID_TYPE, ENUM, STRUCT, CLASSIFICATION, ENTITY, ARRAY, MAP, RELATIONSHIP, BUSINESS_METADATA
  2. attributeDefs里包含了attribute列表。关于每个标签的含义在引文中都有提到。
  3. cardinality - 是否复合
  4. isIndexable - 是否索引
  5. isUnique - 在系统里,相同类型的实例这个值唯一,导入实例时,如果和已有实例冲突会报错。
  6. isOptional指示此属性是(必需的/可选的/还是可以是多值)的,如果是必须的 ,创建或者导入时会有检查。
  7. typename类型
  8. superTypes: 父类型,这是个数组,允许多继承,可以把不同父类型的属性合并到一起

下面这个typedef为数据库定义了元数据类型。

  • 属性locationUri, string, 数据库访问uri
  • 属性createTime, long,时间戳。
  • 属性randomTable是一个表的数组
"entityDefs": [
    {
      "category": "ENTITY",
      "name": "sample_db_type",
      "typeVersion": "1.0",
      "attributeDefs": [
        {
          "name": "locationUri",
          "typeName": "string",
          "cardinality": "SINGLE",
          "isOptional": true,
          "isUnique": false,
          "isIndexable": false
        },
        {
          "name": "createTime",
          "typeName": "long",
          "cardinality": "SINGLE",
          "isOptional": true,
          "isUnique": false,
          "isIndexable": false
        },
        {
          "name": "randomTable",
          "typeName": "array<sample_table_type>",
          "cardinality": "SET",
          "isOptional": true,
          "isUnique": false,
          "isIndexable": false
        }
      ],
      "superTypes": [
        "DataSet"
      ]
    },
    ...
  }

而relationshipsDefs条目定义了typeDef之间的关系,比如下面的例子定义了db和table之间一对多的关系。

endDef1, endDef2分别在db和table上定义了一个属性用来指向对方。endDef1中sample_table_type有个db属性唯一的指向了sample_db_type。endDef2中sample_db_type有个tables属性指向一堆table。

"relationshipDefs": [
    {
      "category": "RELATIONSHIP",
      "name": "sample_Table_DB",
      "description": "sample_Table_DB",
      "typeVersion": "1.0",
      "relationshipCategory": "AGGREGATION",
      "propagateTags": "NONE",
      "endDef1": {
        "type": "sample_table_type",
        "name": "db",
        "cardinality": "SINGLE",
        "isContainer": false,
        "isLegacyAttribute": false
      },
      "endDef2": {
        "type": "sample_db_type",
        "name": "tables",
        "cardinality": "SET",
        "isContainer": true,
        "isLegacyAttribute": false
      }
    },
    ...
 }

那么怎么导入typedef json到atlas呢?

有两个办法

  1. 直接把json文件放到models目录下
  2. 通过atlas rest api导入, 请参考Post /v2/types/typedefs。我不建议使用pypi或者github上的库,维护不及时,出了问题也没有人管,我掉过这坑里面。

endDef1、endDef2不能指定cardinity=LIST,不能同时为container、有一个end为containetr时,关系类型不能为RelationshipCategory.ASSOCIATION,

类型为COMPOSITION和AGGREGATION时,必须有一个end是container

 

The Relationship category determines the style of relationship around containment and lifecycle. UML terminology is used for the values.

ASSOCIATION is a relationship with no containment.
COMPOSITION and AGGREGATION are containment relationships.

The difference being in the lifecycles of the container and its children. In the COMPOSITION case, the children cannot exist without the container. For AGGREGATION, the life cycles of the container and children are totally independant.