在 Github Pages 或任何静态主机上托管 SQLite 数据库

译者:这实在是太强了

原文标题:Hosting SQLite databases on Github Pages (or any static file hoster)

原文地址:https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/

原文许可:GPLv3

原文发布时间:2021.4.17 翻译时的最终修改时间:2021.5.3

注1:原文可能在本篇翻译发布后进行了更新,本篇译文根据译者当时能获取到的最新版本而翻译。译者水平一般,如有错误还恳请读者指正。

注2:下文中的代码在原博客是可以运行的,由于译者水平不足,代码并不能在本文中运行,故贴出正常运行后的输出。您可以点击原文链接去自行感受。


当我注意到我经常写一个从数据库中请求一些数据并将它们以图表的形式展示的网站作为小工具时,我正在写一个小网页来统计并展示一位Youtube创作者在一段时间内有多少人赞助他的内容。但是如果你想使用数据库,你要么写一个后端 (这意味着你需要一直去托管并维护它),要么将整个数据集下载到浏览器中(当数据集超过10MB的时候,这并不美好)。

以前,当我在为一些小项目使用一个后端服务器的时候,总会碰上一些外部API掉线或者密钥过期,又或者是我自己忘记了我还有后端服务器以至于停止为正在运行它的VPS续费。几年后我再回来看这些小东西时,它已经没了。这让我很是恼火,并责怪我自己去用外部服务,或者责备自己在如此长时间内的关心。

托管静态网页比搞“真”服务器来的简单的多 —— 市面上有许多免费且可靠的选择 (例如GitHub、GitLab Pages, Netlify 等等),并且它可以毫不费力地扩展到几乎无穷大。

所以我写了一个工具,它能在静态托管的网页中使用一个真正的 SQL 数据库!

下面是使用 世界发展指标数据集 的演示——这个数据集包含6张表,超过800万行数据(总计670MiB)。

SQL 示例

select country_code, long_name from wdi_country limit 3;

运行输出

[
  {
    "country_code": "ABW",
    "long_name": "Aruba"
  },
  {
    "country_code": "AFG",
    "long_name": "Islamic State of Afghanistan"
  },
  {
    "country_code": "AGO",
    "long_name": "People's Republic of Angola"
  }
]

Sqlite 统计数据

fetched 1.0KB in 1 requests (DB size: 668.8MB)

你可以看到, 我们可以在查询 wdi_country 表时仅仅获取 1kB 的数据!

这是一个完整的查询引擎。因此我们可以使用下例的 SQLite Json 函数

SQL 示例

select json_extract(arr.value, '$.foo.bar') as bar
  from json_each('[{"foo": {"bar": 123}}, {"foo": {"bar": "baz"}}]') as arr

运行输出

[
  {
    "bar": 123
  },
  {
    "bar": "baz"
  }
]

我们也可以注册 JS 函数以便在查询中调用。下面是一个 getFlag 函数示例,它能够获得国家或地区的旗帜emoji。

JS 示例

function getFlag(country_code) {
  // just some unicode magic
  return String.fromCodePoint(...Array.from(country_code||"")
    .map(c => 127397 + c.codePointAt()));
}

await db.create_function("get_flag", getFlag)
return await db.query(`
  select long_name, get_flag("2-alpha_code") as flag from wdi_country
    where region is not null and currency_unit = 'Euro';
`)

运行输出

[
  {
    "long_name": "Principality of Andorra",
    "flag": "🇦🇩"
  },
  {
    "long_name": "Republic of Austria",
    "flag": "🇦🇹"
  },
  {
    "long_name": "Kingdom of Belgium",
    "flag": "🇧🇪"
  },
  {
    "long_name": "Republic of Cyprus",
    "flag": "🇨🇾"
  },
  {
    "long_name": "Federal Republic of Germany",
    "flag": "🇩🇪"
  },
  {
    "long_name": "Kingdom of Spain",
    "flag": "🇪🇸"
  },
  {
    "long_name": "Republic of Estonia",
    "flag": "🇪🇪"
  },
  {
    "long_name": "Republic of Finland",
    "flag": "🇫🇮"
  },
  {
    "long_name": "French Republic",
    "flag": "🇫🇷"
  },
  {
    "long_name": "Hellenic Republic",
    "flag": "🇬🇷"
  },
  {
    "long_name": "Ireland",
    "flag": "🇮🇪"
  },
  {
    "long_name": "Italian Republic",
    "flag": "🇮🇹"
  },
  {
    "long_name": "Republic of Lithuania",
    "flag": "🇱🇹"
  },
  {
    "long_name": "Grand Duchy of Luxembourg",
    "flag": "🇱🇺"
  },
  {
    "long_name": "Republic of Latvia",
    "flag": "🇱🇻"
  },
  {
    "long_name": "St. Martin (French part)",
    "flag": "🇲🇫"
  },
  {
    "long_name": "Principality of Monaco",
    "flag": "🇲🇨"
  },
  {
    "long_name": "Republic of Malta",
    "flag": "🇲🇹"
  },
  {
    "long_name": "Montenegro",
    "flag": "🇲🇪"
  },
  {
    "long_name": "Kingdom of the Netherlands",
    "flag": "🇳🇱"
  },
  {
    "long_name": "Portuguese Republic",
    "flag": "🇵🇹"
  },
  {
    "long_name": "Republic of San Marino",
    "flag": "🇸🇲"
  },
  {
    "long_name": "Slovak Republic",
    "flag": "🇸🇰"
  },
  {
    "long_name": "Republic of Slovenia",
    "flag": "🇸🇮"
  },
  {
    "long_name": "Republic of Kosovo",
    "flag": "🇽🇰"
  }
]

注意,本站是100%托管在托管在静态文件托管(GitHub Pages)上的。

那么,你要如何才能在静态文件托管上使用数据库呢?首先,(使用C语音编写的)SQLite 是能被编译为WebAssembly的。SQLite可以在没有任何修改的情况下使用emscripten进行编译,并且 sql.js 库是围绕 WebAssembly 代码的瘦JS打包工具。

但是 sql.js 只允许你去创建和读取完全存储在内存中的数据库——所以我实现了一个虚拟文件系统sql.js-httpvfs,它能够在 SQLite 尝试从文件系统中读取时使用HTTP范围请求来获取数据库的chunk。从SQLite的角度来看,这看起来是它自己在一台普通电脑上运行,除了里面有个可以读取的名叫 /wdi.sqlite3 的文件外,其他的文件系统就是空的。当然它并不能写入这个文件,但是一个只读的数据库依然十分有用。

由于通过HTTP来获取所有数据有相当大的开销,所以我们要以块的方式来获取数据,并期以在请求数与使用的带宽之间寻找一定的平衡。值得庆幸的是,SQLite 已经将其数据库组织进用户定义大小(默认为4KiB)的“页”中。我已经将此数据库的页大小设置为1KiB。

下面是一个简单的索引查询示例:

SQL 示例

select indicator_code, long_definition from wdi_series where indicator_name
    = 'Literacy rate, youth total (% of people ages 15-24)'

运行输出

[
  {
    "indicator_code": "SE.ADT.1524.LT.ZS",
    "long_definition": "Youth literacy rate is the percentage of people ages 15-24 who can both read and write with understanding a short simple statement about their everyday life."
  }
]

SQLite 页读取请求

PageCacheAccess patternTable/IndexPage Type
0hit[system]
17hitsqlite_schemaleaf
57458missrandom (1 KiB XHR)index on wdi_series (indicator_name)leaf
57464missrandom (1 KiB XHR)index on wdi_series (indicator_name)leaf
6hitwdi_footnoteinternal
55684missrandom (1 KiB XHR)wdi_seriesleaf
55774missrandom (1 KiB XHR)wdi_seriesoverflow

运行上述查询并检查日志。SQLite 为该查询执行了7页的读取。

  • 三页的读取只是为了获取架构信息,这些已被缓存。
  • 两页的读取是查询 on wdi_series (indicator_name) 中的索引。
  • 两页的读取是在 wdi_series 表上,第一张是为了通过主键查找行值,第二张是在溢出页中获取文本数据。

索引和表的读取都是 B 树查找。

来一个更复杂的问题:根据2010年以后的最新数据,青年识字率最低的国家是哪个?

SQL 示例

with newest_datapoints as (
  select country_code, indicator_code, max(year) as year from wdi_data
  join wdi_series using (indicator_code)
  where
    indicator_name = 'Literacy rate, youth total (% of people ages 15-24)'
    and year > 2010
  group by country_code
)
select c.short_name as country, printf('%.1f %%', value) as "Youth Literacy Rate"
from wdi_data
  join wdi_country c using (country_code)
  join newest_datapoints using (indicator_code, country_code, year)
order by value asc limit 10

运行输出

[
  {
    "country": "Chad",
    "Youth Literacy Rate": "30.8 %"
  },
  {
    "country": "Central African Republic",
    "Youth Literacy Rate": "38.3 %"
  },
  {
    "country": "Niger",
    "Youth Literacy Rate": "43.5 %"
  },
  {
    "country": "Guinea",
    "Youth Literacy Rate": "46.3 %"
  },
  {
    "country": "South Sudan",
    "Youth Literacy Rate": "47.9 %"
  },
  {
    "country": "Mali",
    "Youth Literacy Rate": "50.1 %"
  },
  {
    "country": "Liberia",
    "Youth Literacy Rate": "55.4 %"
  },
  {
    "country": "Burkina Faso",
    "Youth Literacy Rate": "58.3 %"
  },
  {
    "country": "Côte d'Ivoire",
    "Youth Literacy Rate": "58.4 %"
  },
  {
    "country": "Guinea-Bissau",
    "Youth Literacy Rate": "60.4 %"
  }
]

SQLite 页读取请求 (太大了,不放)

上面的查询应该会做10-20个GET请求,总共获取 130 ~ 270KiB 数据。请注意,它只需要执行 20 个请求,而不是 270 个请求(对于一个270KiB的文件一次获取1KiB时预期的数量)。这是因为我实现了一个预取系统,该系统试图通过三个单独的虚拟读取头来检测访问模式,并以指数方式增加顺序读取的请求大小。这意味着读取超过数 KiB 数据的索引扫描或表扫描只会导致在扫描的总字节长度上成对数的请求数。您可以通过查看上面页面读取日志中的“Access pattern”列来查看其效果。

只有当数据库中的索引与查询匹配良好时,所有这些方法才能很好地工作。例如,上述查询中使用的索引是 INDEX ON wdi_data (indicator_code, country_code, year, value) 。如果该索引不包含值列,则 SQLite 引擎必须执行另一次随机访问(不可预测)读取,从而使 HTTP 请求检索每个数据点的实际值。如果索引是以 country_code, indicator_code, ... 排序的,那么我们将能够快速获取单个国家或地区的所有指标,但不是单个指标的所有国家或地区值。

我们还可以利用SQLite FTS模块,这样我们就可以对数据库中文本较多的信息进行全文搜索 —— 在下面的例子中,数据库中有超过1000个人类发展指标,并且具有更长的描述。

SQL 示例

select * from indicator_search
where indicator_search match 'educatio* femal*'
order by rank limit 10

运行输出

[
  {
    "indicator_code": "SL.UEM.BASC.FE.ZS",
    "topic": "Social Protection & Labor: Unemployment",
    "indicator_name": "Unemployment with basic education, female (% of female labor force with basic education)",
    "short_definition": null,
    "long_definition": "The percentage of the labor force with a basic level of education who are unemployed. Basic education comprises primary education or lower secondary education according to the International Standard Classification of Education 2011 (ISCED 2011).",
    "statistical_concept_and_methodology": null,
    "development_relevance": null
  },
  {
    "indicator_code": "SL.UEM.INTM.FE.ZS",
    "topic": "Social Protection & Labor: Unemployment",
    "indicator_name": "Unemployment with intermediate education, female (% of female labor force with intermediate education)",
    "short_definition": null,
    "long_definition": "The percentage of the labor force with an intermediate level of education who are unemployed. Intermediate education comprises upper secondary or post-secondary non tertiary education according to the International Standard Classification of Education 2011 (ISCED 2011).",
    "statistical_concept_and_methodology": null,
    "development_relevance": null
  },
  {
    "indicator_code": "SL.TLF.BASC.FE.ZS",
    "topic": "Social Protection & Labor: Labor force structure",
    "indicator_name": "Labor force with basic education, female (% of female working-age population with basic education)",
    "short_definition": null,
    "long_definition": "The percentage of the working age population with a basic level of education who are in the labor force. Basic education comprises primary education or lower secondary education according to the International Standard Classification of Education 2011 (ISCED 2011).",
    "statistical_concept_and_methodology": null,
    "development_relevance": null
  },
  {
    "indicator_code": "SL.TLF.INTM.FE.ZS",
    "topic": "Social Protection & Labor: Labor force structure",
    "indicator_name": "Labor force with intermediate education, female (% of female working-age population with intermediate education)",
    "short_definition": null,
    "long_definition": "The percentage of the working age population with an intermediate level of education who are in the labor force. Intermediate education comprises upper secondary or post-secondary non tertiary education according to the International Standard Classification of Education 2011 (ISCED 2011).",
    "statistical_concept_and_methodology": null,
    "development_relevance": null
  },
  {
    "indicator_code": "SL.UEM.ADVN.FE.ZS",
    "topic": "Social Protection & Labor: Unemployment",
    "indicator_name": "Unemployment with advanced education, female (% of female labor force with advanced education)",
    "short_definition": null,
    "long_definition": "The percentage of the labor force with an advanced level of education who are unemployed. Advanced education comprises short-cycle tertiary education, a bachelor’s degree or equivalent education level, a master’s degree or equivalent education level, or doctoral degree or equivalent education level according to the International Standard Classification of Education 2011 (ISCED 2011).",
    "statistical_concept_and_methodology": null,
    "development_relevance": null
  },
  {
    "indicator_code": "SL.TLF.ADVN.FE.ZS",
    "topic": "Social Protection & Labor: Labor force structure",
    "indicator_name": "Labor force with advanced education, female (% of female working-age population with advanced education)",
    "short_definition": null,
    "long_definition": "The percentage of the working age population with an advanced level of education who are in the labor force. Advanced education comprises short-cycle tertiary education, a bachelor’s degree or equivalent education level, a master’s degree or equivalent education level, or doctoral degree or equivalent education level according to the International Standard Classification of Education 2011 (ISCED 2011).",
    "statistical_concept_and_methodology": null,
    "development_relevance": null
  },
  {
    "indicator_code": "SE.PRE.TCAQ.FE.ZS",
    "topic": "Education: Inputs",
    "indicator_name": "Trained teachers in preprimary education, female (% of female teachers)",
    "short_definition": null,
    "long_definition": "Trained teachers in preprimary education are the percentage of preprimary school teachers who have received the minimum organized teacher training (pre-service or in-service) required for teaching in a given country.",
    "statistical_concept_and_methodology": "Share of trained teachers is calculated by dividing the number of trained teachers of the specified level of education by total number of teachers at the same level of education, and multiplying by 100. \n\nData on education are collected by the UNESCO Institute for Statistics from official responses to its annual education survey. All the data are mapped to the International Standard Classification of Education (ISCED) to ensure the comparability of education programs at the international level. The current version was formally adopted by UNESCO Member States in 2011. \n\nThe reference years reflect the school year for which the data are presented. In some countries the school year spans two calendar years (for example, from September 2010 to June 2011); in these cases the reference year refers to the year in which the school year ended (2011 in the example).",
    "development_relevance": "Trained teachers refer to teaching force with the necessary pedagogical skills to teach and use teaching materials in an effective manner. The share of trained teachers reveals a country's commitment to investing in the development of its human capital engaged in teaching.\n\nTeachers are important resource, especially for children who are the first-generation of receiving education in their families and heavily rely on teachers in acquiring basic literacy skills. However, rapid increase in enrollments may cause the shortage of trained teachers. Education finance is a key for appropriate teacher allocations, since teacher salaries account for a large share of education budgets. The shortage of trained teacher may result in low qualified teachers in more disadvantaged area."
  },
  {
    "indicator_code": "SE.PRM.TCAQ.FE.ZS",
    "topic": "Education: Inputs",
    "indicator_name": "Trained teachers in primary education, female (% of female teachers)",
    "short_definition": null,
    "long_definition": "Trained teachers in primary education are the percentage of primary school teachers who have received the minimum organized teacher training (pre-service or in-service) required for teaching in a given country.",
    "statistical_concept_and_methodology": "Share of trained teachers is calculated by dividing the number of trained teachers of the specified level of education by total number of teachers at the same level of education, and multiplying by 100. \n\nData on education are collected by the UNESCO Institute for Statistics from official responses to its annual education survey. All the data are mapped to the International Standard Classification of Education (ISCED) to ensure the comparability of education programs at the international level. The current version was formally adopted by UNESCO Member States in 2011. \n\nThe reference years reflect the school year for which the data are presented. In some countries the school year spans two calendar years (for example, from September 2010 to June 2011); in these cases the reference year refers to the year in which the school year ended (2011 in the example).",
    "development_relevance": "Trained teachers refer to teaching force with the necessary pedagogical skills to teach and use teaching materials in an effective manner. The share of trained teachers reveals a country's commitment to investing in the development of its human capital engaged in teaching.\n\nTeachers are important resource, especially for children who are the first-generation of receiving education in their families and heavily rely on teachers in acquiring basic literacy skills. However, rapid increase in enrollments may cause the shortage of trained teachers. Education finance is a key for appropriate teacher allocations, since teacher salaries account for a large share of education budgets. The shortage of trained teacher may result in low qualified teachers in more disadvantaged area."
  },
  {
    "indicator_code": "SE.SEC.TCAQ.FE.ZS",
    "topic": "Education: Inputs",
    "indicator_name": "Trained teachers in secondary education, female (% of female teachers)",
    "short_definition": null,
    "long_definition": "Trained teachers in secondary education are the percentage of secondary school teachers who have received the minimum organized teacher training (pre-service or in-service) required for teaching in a given country.",
    "statistical_concept_and_methodology": "Share of trained teachers is calculated by dividing the number of trained teachers of the specified level of education by total number of teachers at the same level of education, and multiplying by 100. \n\nData on education are collected by the UNESCO Institute for Statistics from official responses to its annual education survey. All the data are mapped to the International Standard Classification of Education (ISCED) to ensure the comparability of education programs at the international level. The current version was formally adopted by UNESCO Member States in 2011. \n\nThe reference years reflect the school year for which the data are presented. In some countries the school year spans two calendar years (for example, from September 2010 to June 2011); in these cases the reference year refers to the year in which the school year ended (2011 in the example).",
    "development_relevance": "Trained teachers refer to teaching force with the necessary pedagogical skills to teach and use teaching materials in an effective manner. The share of trained teachers reveals a country's commitment to investing in the development of its human capital engaged in teaching.\n\nTeachers are important resource, especially for children who are the first-generation of receiving education in their families and heavily rely on teachers in acquiring basic literacy skills. However, rapid increase in enrollments may cause the shortage of trained teachers. Education finance is a key for appropriate teacher allocations, since teacher salaries account for a large share of education budgets. The shortage of trained teacher may result in low qualified teachers in more disadvantaged area."
  },
  {
    "indicator_code": "SE.SEC.TCAQ.LO.FE.ZS",
    "topic": "Education: Inputs",
    "indicator_name": "Trained teachers in lower secondary education, female (% of female teachers)",
    "short_definition": null,
    "long_definition": "Trained teachers in lower secondary education are the percentage of lower secondary school teachers who have received the minimum organized teacher training (pre-service or in-service) required for teaching in a given country.",
    "statistical_concept_and_methodology": "Share of trained teachers is calculated by dividing the number of trained teachers of the specified level of education by total number of teachers at the same level of education, and multiplying by 100. \n\nData on education are collected by the UNESCO Institute for Statistics from official responses to its annual education survey. All the data are mapped to the International Standard Classification of Education (ISCED) to ensure the comparability of education programs at the international level. The current version was formally adopted by UNESCO Member States in 2011. \n\nThe reference years reflect the school year for which the data are presented. In some countries the school year spans two calendar years (for example, from September 2010 to June 2011); in these cases the reference year refers to the year in which the school year ended (2011 in the example).",
    "development_relevance": "Trained teachers refer to teaching force with the necessary pedagogical skills to teach and use teaching materials in an effective manner. The share of trained teachers reveals a country's commitment to investing in the development of its human capital engaged in teaching.\n\nTeachers are important resource, especially for children who are the first-generation of receiving education in their families and heavily rely on teachers in acquiring basic literacy skills. However, rapid increase in enrollments may cause the shortage of trained teachers. Education finance is a key for appropriate teacher allocations, since teacher salaries account for a large share of education budgets. The shortage of trained teacher may result in low qualified teachers in more disadvantaged area."
  }
]

indicator_search FTS 表中的数据总量约为 8 MB。上面的查询应该只获取了大约70 KiB的数据。您可以在此处查看其构造方式。

最后,这里更完整地演示了这个系统的实用性 —— 这里有一个交互式图表,显示了一些国家或地区随时间的发展情况,适用于在数据库中的任何指标的任何国家/地区,只要你想:

示例图片
示例图片

注意,许多指标只适用于某些国家,例如,指标“认为丈夫有理由在妻子焚烧食物时殴打妻子的妇女”仅基于在较欠发达国家或地区进行的调查。

扩展: DOM 数据库

既然我们已经在浏览器中运行了一个数据库,为什么不使用一个虚拟表dom来使我们的浏览器作为数据库?

SQL 示例

select count(*) as number_of_demos from dom
  where selector match '.content div.sqlite-httpvfs-demo';
select count(*) as sqlite_mentions from dom
  where selector match '.content p' and textContent like '%SQLite%';

运行输出

[
  {
    "number_of_demos": 10
  },
  {
    "sqlite_mentions": 8
  }
]

我们甚至可以将元素直接插入到 DOM 中:

SQL 示例

insert into dom (parent, tagName, textContent)
    select 'ul#outtable1', 'li', short_name
    from wdi_country where currency_unit = 'Euro'

运行之前
运行之前
运行之后
运行之后

并且更新 DOM 中的元素:

SQL 示例

update dom set textContent =
  get_flag("2-alpha_code") || ' ' || textContent
from wdi_country
where selector match 'ul#outtable1 > li'
  and textContent = wdi_country.short_name

运行之前
运行之前
运行之后
运行之后


当然,文章里的一切都是开源的。sqlite 打包工具的主要实现是 sql.js-httpvfs。这篇博客文章的源代码是一个 pandoc markdown文件,演示是一个 自定义的“受限代码块”React组件