Semantic type classification — detects 244 data types (emails, URLs, dates, UUIDs, currencies, etc.) from raw strings
Installing and Loading
INSTALL finetype FROM community;
LOAD finetype;
Example
-- Profile every column of a table: one row per column, with the detected
-- semantic type, model confidence, and recommended DuckDB storage type.
D CREATE TABLE people AS SELECT * FROM (VALUES
('jane.doe@company.co.uk', '+44 20 7946 0958', {'city': 'London'}),
('john.smith@example.org', '0117 496 0123', {'city': 'Bristol'}),
('not-an-email', 'not-a-phone', {'city': 'Leeds'})
) AS t(email, phone, addr);
D SELECT * FROM ft_profile('people');
┌─────────────┬──────────────────────────────────────────────────────────────────────────────────┬────────────────────┬─────────────┐
│ column_name │ type │ confidence │ duckdb_type │
│ varchar │ varchar │ double │ varchar │
├─────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┤
│ addr │ nested STRUCT(city VARCHAR) column — unnest / to_json / extract before profiling │ NULL │ NULL │
│ email │ identity.person.email │ 0.5812865495681763 │ VARCHAR │
│ phone │ identity.person.phone_number │ 0.5 │ VARCHAR │
└─────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────────────────┴─────────────┘
-- (addr is a nested STRUCT column: instead of silently mis-classifying it,
-- ft_profile emits a flatten hint in the type column and leaves
-- confidence / duckdb_type NULL. See the nested-column note below.)
-- Validate a table against a JSON Schema (note the top-level "properties"
-- key): one row per column, counting how many values the schema rejects.
D SELECT * FROM ft_validate('people',
'{"properties":{"email":{"type":"string","pattern":"^[^@]+@[^@]+\.[^@]+$"},
"phone":{"type":"string","pattern":"^\+?[0-9().+ -]{7,}$"}}}');
┌─────────────┬───────┬─────────┬───────────────────────────────────────────────────────────────────────────────────┐
│ column_name │ total │ rejects │ sample_message │
│ varchar │ int64 │ int64 │ varchar │
├─────────────┼───────┼─────────┼───────────────────────────────────────────────────────────────────────────────────┤
│ addr │ NULL │ NULL │ nested STRUCT(city VARCHAR) column — unnest / to_json / extract before validating │
│ email │ 3 │ 1 │ "not-an-email" does not match "^[^@]+@[^@]+\.[^@]+$" │
│ phone │ 3 │ 1 │ "not-a-phone" does not match "^\+?[0-9().+ -]{7,}$" │
└─────────────┴───────┴─────────┴───────────────────────────────────────────────────────────────────────────────────┘
-- Classify a single value (per-value, no column context)
D SELECT ft_infer('https://example.com') AS detected_type;
┌──────────────────────────┐
│ detected_type │
│ varchar │
├──────────────────────────┤
│ technology.internet.url │
└──────────────────────────┘
About finetype
FineType is a semantic type classifier that detects 244 data types from raw string values. A
multi-branch neural network labels each value, mapping it into a three-level taxonomy:
domain.category.type.
FineType is column-oriented: its accuracy comes from seeing a whole column at once, so the
primary surface is the ft_profile / ft_validate table verbs below. Per-value scalar
functions are also provided for ad-hoc classification, but a lone value that normally leans on its
column neighbours to disambiguate (a bare UUID, a phone number) is classified less confidently
than the same value seen in column context.
Table verbs
These take a table (or a LIST of column values) and return one row per column — the natural
grain for profiling and validation.
ft_profile(table_name VARCHAR)
Profile every column of a table. Returns (column_name, type, confidence, duckdb_type),
one row per column. The table is named as a string literal; FineType reaches into the catalog to
read its columns.
SELECT * FROM ft_profile('my_table');
Nested STRUCT / LIST columns are guarded, not silently rejected: rather than forcing them
through per-value classification that would misreport them, ft_profile returns a flatten hint in
the type column (with confidence and duckdb_type left NULL). Flatten the column first if you
need it typed.
ft_validate(table_name VARCHAR, schema_json VARCHAR)
Validate a table's columns against a JSON Schema. Returns one row per validated column with the
total values checked, the reject count, and a sample rejection message. The schema can be supplied
inline, via getvariable, or read from a file with DuckDB's read_text:
The JSON Schema is keyed by column under a top-level properties object.
-- inline
SELECT * FROM ft_validate('my_table', '{"properties":{"email":{"type":"string","format":"email"}}}');
-- from a file (a bare path is read for you; an inline string starting with "{" is used as-is)
SELECT * FROM ft_validate('my_table', 'schema.json');
ft_profile(values LIST)
Profile a single column passed as a list — useful when the values are already in hand rather than in a named table.
SELECT * FROM ft_profile((SELECT list(email) FROM people));
Scalar functions
Per-value classification, one value at a time. Strongest on values unambiguous in isolation (URLs, ISO dates, well-formed emails, IP addresses).
ft_infer(value VARCHAR) → VARCHAR
Classify a single value. Returns the full semantic type label.
SELECT ft_infer('https://example.com'); -- technology.internet.url
SELECT ft_infer('2024-01-15'); -- datetime.date.iso
SELECT ft_infer('true'); -- representation.boolean.terms
ft_detail(value VARCHAR) → VARCHAR
Classify with full detail. Returns JSON with the type, confidence (0.0–1.0), the recommended DuckDB type, the number of samples seen, the disambiguation path, and the per-label vote map.
SELECT ft_detail('192.168.1.1');
-- {"type": "technology.internet.ip_v4", "confidence": 0.901, "duckdb_type": "INET", ...}
ft_cast(value VARCHAR) → VARCHAR
Normalize a value for safe TRY_CAST() to its detected DuckDB type. Handles date format
conversion (US/EU → ISO), boolean normalization, UUID lowercasing, numeric cleanup.
SELECT ft_cast('01/15/2024'); -- 2024-01-15 (US date → ISO)
ft_validate_text(value VARCHAR, schema_json VARCHAR) → STRUCT(valid BOOLEAN, "constraint" VARCHAR, message VARCHAR)
Validate a single value against a JSON Schema fragment. Returns a struct: valid is true when
the value conforms (with constraint and message NULL); on failure valid is false and
constraint / message name the failing keyword and the reason. This is the per-cell engine the
ft_validate table verb runs over every column.
SELECT ft_validate_text('not-an-email', '{"type":"string","pattern":"^[^@]+@[^@]+\.[^@]+$"}');
-- {'valid': false, 'constraint': pattern, 'message': '"not-an-email" does not match "^[^@]+@[^@]+\.[^@]+$"'}
ft_unpack(json VARCHAR) → VARCHAR
Recursively classify every scalar value in a JSON document. Returns annotated JSON carrying the type, confidence, recommended DuckDB type, and original value for each field.
ft_version() → VARCHAR
Returns the extension version string.
Aliases. The earlier un-prefixed scalar names (
finetype,finetype_detail,finetype_cast,finetype_unpack,finetype_validate,finetype_version) remain registered as aliases of theft_scalars, so existing code keeps working. New code should use theft_names.
Type Taxonomy
244 types organized into 7 domains:
- container: JSON, XML, CSV, arrays, key-value (11 types)
- datetime: dates, times, timestamps, epochs, durations, offsets (86 types)
- finance: currencies, accounting, market identifiers, transactions (28 types)
- geography: coordinates, locations, addresses, transportation (25 types)
- identity: names, emails, phones, payments, medical (33 types)
- representation: booleans, numbers, text, files, scientific (32 types)
- technology: URLs, IPs, UUIDs, versions, codes (29 types)
For more information, see the FineType documentation.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| finetype | scalar | NULL | NULL | |
| finetype_cast | scalar | NULL | NULL | |
| finetype_detail | scalar | NULL | NULL | |
| finetype_spike | table | NULL | NULL | |
| finetype_unpack | scalar | NULL | NULL | |
| finetype_validate | scalar | NULL | NULL | |
| finetype_version | scalar | NULL | NULL | |
| ft_cast | scalar | NULL | NULL | |
| ft_detail | scalar | NULL | NULL | |
| ft_infer | scalar | NULL | NULL | |
| ft_profile | scalar | NULL | NULL | |
| ft_profile | table_macro | NULL | NULL | |
| ft_unpack | scalar | NULL | NULL | |
| ft_validate | table_macro | NULL | NULL | |
| ft_validate_text | scalar | NULL | NULL | |
| ft_version | scalar | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
This extension does not add any settings.