数据库_duckdb_本地访问远程数据_ aws s3

发布时间 2023-04-27 18:21:23作者: 辰令

Extensions 扩展

Full Text Search   HTTPFS  JSON
 Postgres Scanner   SQLite Scanner

python3 -c "import duckdb; duckdb.query('INSTALL httpfs;');" 
Loading an extension from local storage	
 参考 https://duckdb.org/docs/extensions/overview
 https://towardsdatascience.com/a-serverless-query-engine-from-spare-parts-bd6320f10353
  there is a very big difference between a bad idea and a good idea badly executed.
  利用存储和计算的分离,在云中构建一个带有查询引擎的数据湖
    01.数据 uploaded in  aws S3 in a parquet format
    02.  创建带有嵌入式DuckDB的笔记本电脑或小型数据应用程序,以在本地对生产数据进行原型和实验

使用DuckDB查询s3存储桶中的 parquet 文件。

远程parquet文件运行查询之前, 需要安装 httpfs 扩展
DuckDB
    The httpfs extension is a loadable extension implementing a file system that allows reading remote/writing remote files.
	 For pure HTTP(S), only file reading is supported. 
	 For object storage using the S3 API, the httpfs extension supports reading/writing/globbing files.
	 
	With the httpfs extension, it is possible to directly query files over HTTP(S). This currently works for CSV and Parquet files. 

远程安装
 eg:Full-Text Search Indexes 
  INSTALL 'fts';
  LOAD 'fts';

下载后本地安装:Loading an extension from local storage
   https://extensions.duckdb.org/v0.7.0/windows_amd64/json.duckdb_extension.gz	
 unzipping
 the install and load commands can be used with the path to the .duckdb_extension file. 
 install 'httpfs.duckdb_extension';
 load 'httpfs.duckdb_extension';

示例

import pandas as pd
import duckdb

query = """
    INSTALL httpfs;
    LOAD httpfs;
    SET s3_region='us-west-2';
    SET s3_access_key_id='key';
    SET s3_secret_access_key='secret';
	#SET s3_session_token='session-token';
    SELECT 
        FROM read_parquet('s3://bucket/folder/file.parquet') 

cursor = duckdb.connect()

cursor.execute(query).df()

SQLite中的扩展

# SQLite FTS5 全文索引功能以扩展模块存在。使用全文索引,可以快速对大段文字进行搜索。SQLite提供FTS3、FTS4、FTS5三个模块
import sqlite3
conn = sqlite.connect('some_db.db')
sqlite.enable_load_extension(True)
sqlite.load_extension('fts5') 

sqlite3的编译选项
a = conn.execute("pragma compile_options;")
for i in a:
    print(i)
如果没有,则编译sqlite3,增加json1扩展 编译pysqlite3