Search Shortcut cmd + k | ctrl + k
finetype

Semantic type classification — detects 244 data types (emails, URLs, dates, UUIDs, currencies, etc.) from raw strings

Maintainer(s): hughcameron

Installing and Loading

INSTALL finetype FROM community;
LOAD finetype;

Example

-- Profile every column of a table: one row per column, with the detected
-- semantic type, model confidence, and recommended DuckDB storage type.
D CREATE TABLE people AS SELECT * FROM (VALUES
    ('jane.doe@company.co.uk', '+44 20 7946 0958', {'city': 'London'}),
    ('john.smith@example.org', '0117 496 0123',    {'city': 'Bristol'}),
    ('not-an-email',           'not-a-phone',       {'city': 'Leeds'})
  ) AS t(email, phone, addr);

D SELECT * FROM ft_profile('people');
┌─────────────┬──────────────────────────────────────────────────────────────────────────────────┬────────────────────┬─────────────┐
 column_name                                        type                                            confidence      duckdb_type 
   varchar                                        varchar                                             double          varchar   
├─────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┤
 addr         nested STRUCT(city VARCHAR) column  unnest / to_json / extract before profiling                NULL  NULL        
 email        identity.person.email                                                             0.5812865495681763  VARCHAR     
 phone        identity.person.phone_number                                                                     0.5  VARCHAR     
└─────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────────────────┴─────────────┘
-- (addr is a nested STRUCT column: instead of silently mis-classifying it,
--  ft_profile emits a flatten hint in the type column and leaves
--  confidence / duckdb_type NULL. See the nested-column note below.)

-- Validate a table against a JSON Schema (note the top-level "properties"
-- key): one row per column, counting how many values the schema rejects.
D SELECT * FROM ft_validate('people',
    '{"properties":{"email":{"type":"string","pattern":"^[^@]+@[^@]+\.[^@]+$"},
                    "phone":{"type":"string","pattern":"^\+?[0-9().+ -]{7,}$"}}}');
┌─────────────┬───────┬─────────┬───────────────────────────────────────────────────────────────────────────────────┐
 column_name  total  rejects                                   sample_message                                   
   varchar    int64   int64                                        varchar                                      
├─────────────┼───────┼─────────┼───────────────────────────────────────────────────────────────────────────────────┤
 addr          NULL     NULL  nested STRUCT(city VARCHAR) column  unnest / to_json / extract before validating 
 email            3        1  "not-an-email" does not match "^[^@]+@[^@]+\.[^@]+$"                              
 phone            3        1  "not-a-phone" does not match "^\+?[0-9().+ -]{7,}$"                               
└─────────────┴───────┴─────────┴───────────────────────────────────────────────────────────────────────────────────┘

-- Classify a single value (per-value, no column context)
D SELECT ft_infer('https://example.com') AS detected_type;
┌──────────────────────────┐
      detected_type       
         varchar          
├──────────────────────────┤
 technology.internet.url  
└──────────────────────────┘

About finetype

FineType is a semantic type classifier that detects 244 data types from raw string values. A multi-branch neural network labels each value, mapping it into a three-level taxonomy: domain.category.type.

FineType is column-oriented: its accuracy comes from seeing a whole column at once, so the primary surface is the ft_profile / ft_validate table verbs below. Per-value scalar functions are also provided for ad-hoc classification, but a lone value that normally leans on its column neighbours to disambiguate (a bare UUID, a phone number) is classified less confidently than the same value seen in column context.

Table verbs

These take a table (or a LIST of column values) and return one row per column — the natural grain for profiling and validation.

ft_profile(table_name VARCHAR)

Profile every column of a table. Returns (column_name, type, confidence, duckdb_type), one row per column. The table is named as a string literal; FineType reaches into the catalog to read its columns.

SELECT * FROM ft_profile('my_table');

Nested STRUCT / LIST columns are guarded, not silently rejected: rather than forcing them through per-value classification that would misreport them, ft_profile returns a flatten hint in the type column (with confidence and duckdb_type left NULL). Flatten the column first if you need it typed.

ft_validate(table_name VARCHAR, schema_json VARCHAR)

Validate a table's columns against a JSON Schema. Returns one row per validated column with the total values checked, the reject count, and a sample rejection message. The schema can be supplied inline, via getvariable, or read from a file with DuckDB's read_text:

The JSON Schema is keyed by column under a top-level properties object.

-- inline
SELECT * FROM ft_validate('my_table', '{"properties":{"email":{"type":"string","format":"email"}}}');

-- from a file (a bare path is read for you; an inline string starting with "{" is used as-is)
SELECT * FROM ft_validate('my_table', 'schema.json');

ft_profile(values LIST)

Profile a single column passed as a list — useful when the values are already in hand rather than in a named table.

SELECT * FROM ft_profile((SELECT list(email) FROM people));

Scalar functions

Per-value classification, one value at a time. Strongest on values unambiguous in isolation (URLs, ISO dates, well-formed emails, IP addresses).

ft_infer(value VARCHAR) → VARCHAR

Classify a single value. Returns the full semantic type label.

SELECT ft_infer('https://example.com');  -- technology.internet.url
SELECT ft_infer('2024-01-15');           -- datetime.date.iso
SELECT ft_infer('true');                 -- representation.boolean.terms

ft_detail(value VARCHAR) → VARCHAR

Classify with full detail. Returns JSON with the type, confidence (0.0–1.0), the recommended DuckDB type, the number of samples seen, the disambiguation path, and the per-label vote map.

SELECT ft_detail('192.168.1.1');
-- {"type": "technology.internet.ip_v4", "confidence": 0.901, "duckdb_type": "INET", ...}

ft_cast(value VARCHAR) → VARCHAR

Normalize a value for safe TRY_CAST() to its detected DuckDB type. Handles date format conversion (US/EU → ISO), boolean normalization, UUID lowercasing, numeric cleanup.

SELECT ft_cast('01/15/2024');  -- 2024-01-15 (US date → ISO)

ft_validate_text(value VARCHAR, schema_json VARCHAR) → STRUCT(valid BOOLEAN, "constraint" VARCHAR, message VARCHAR)

Validate a single value against a JSON Schema fragment. Returns a struct: valid is true when the value conforms (with constraint and message NULL); on failure valid is false and constraint / message name the failing keyword and the reason. This is the per-cell engine the ft_validate table verb runs over every column.

SELECT ft_validate_text('not-an-email', '{"type":"string","pattern":"^[^@]+@[^@]+\.[^@]+$"}');
-- {'valid': false, 'constraint': pattern, 'message': '"not-an-email" does not match "^[^@]+@[^@]+\.[^@]+$"'}

ft_unpack(json VARCHAR) → VARCHAR

Recursively classify every scalar value in a JSON document. Returns annotated JSON carrying the type, confidence, recommended DuckDB type, and original value for each field.

ft_version() → VARCHAR

Returns the extension version string.

Aliases. The earlier un-prefixed scalar names (finetype, finetype_detail, finetype_cast, finetype_unpack, finetype_validate, finetype_version) remain registered as aliases of the ft_ scalars, so existing code keeps working. New code should use the ft_ names.

Type Taxonomy

244 types organized into 7 domains:

  • container: JSON, XML, CSV, arrays, key-value (11 types)
  • datetime: dates, times, timestamps, epochs, durations, offsets (86 types)
  • finance: currencies, accounting, market identifiers, transactions (28 types)
  • geography: coordinates, locations, addresses, transportation (25 types)
  • identity: names, emails, phones, payments, medical (33 types)
  • representation: booleans, numbers, text, files, scientific (32 types)
  • technology: URLs, IPs, UUIDs, versions, codes (29 types)

For more information, see the FineType documentation.

Added Functions

function_name function_type description comment examples
finetype scalar NULL NULL  
finetype_cast scalar NULL NULL  
finetype_detail scalar NULL NULL  
finetype_spike table NULL NULL  
finetype_unpack scalar NULL NULL  
finetype_validate scalar NULL NULL  
finetype_version scalar NULL NULL  
ft_cast scalar NULL NULL  
ft_detail scalar NULL NULL  
ft_infer scalar NULL NULL  
ft_profile scalar NULL NULL  
ft_profile table_macro NULL NULL  
ft_unpack scalar NULL NULL  
ft_validate table_macro NULL NULL  
ft_validate_text scalar NULL NULL  
ft_version scalar NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.