16
BLACKLYNX SQL GETTING STARTED GUIDE Using BlackLynx SQL Extensions Version 1.3 ABSTRACT Use the BlackLynx ODBC/JDBC Connector to interface your data to your business analytics, business intelligence, or business visualization applications without the need for indexing nor ETL June 21, 2019

BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BLACKLYNX SQL

GETTING STARTED

GUIDE Using BlackLynx SQL Extensions

Version 1.3

ABSTRACT Use the BlackLynx ODBC/JDBC Connector to

interface your data to your business analytics,

business intelligence, or business visualization

applications without the need for indexing nor ETL

June 21, 2019

Page 2: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 1

Revisions:

Date Reason for Change ODBC Version

April 2019 Update for compatibility with Blacklynx Restful server revision 1.3.0

Add sections on PCAP and PIP primitives

Add “any” field search extension

2.7.0.2

March 2019 Added Centos 7.6 support

Added field_delimiter parameter for CSV type files

2.6.45.4

December 2018 Added PCAP support 2.6.45.1

Use the BlackLynx ODBC/JDBC Connector to interface your data to your business analytics, business intelligence,

or business visualization applications without the need for indexing or ETL (extract, transform and load). The

connector will interface with data in the following formats:

PCAP (protocol capture binary)

XML

CSV

JSON

Unstructured

Table of Contents Revisions: .................................................................................................................................................................1

SQL Query Extensions .................................................................................................................................................3

Regular Expression .................................................................................................................................................3

Fuzzy Hamming Search ...........................................................................................................................................3

Fuzzy Edit Distance Search .....................................................................................................................................3

Compare Fuzzy Hamming to Fuzzy Edit Distance ...................................................................................................3

Point in Polygon Search ..........................................................................................................................................4

Any Field Search .....................................................................................................................................................5

SQL Query Extensions Syntax .....................................................................................................................................5

Regular Expression SQL Syntax ...............................................................................................................................5

Fuzzy Search SQL Syntax .........................................................................................................................................6

Search Surrounding Width parameter ...................................................................................................................6

Case Insensitive “Where” Clause ...........................................................................................................................6

PIP SQL Syntax ........................................................................................................................................................6

PIP SQL Query Example ..............................................................................................................................................7

Page 3: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 2

PCAP SQL Query Example ...........................................................................................................................................8

Regular Expression on Logs Example ...................................................................................................................... 10

Edit Distance Search Example ................................................................................................................................. 11

Sample SQL queries ................................................................................................................................................. 12

Raw pcap data set 16GB ...................................................................................................................................... 12

Other datasets ..................................................................................................................................................... 12

XML Dataset .................................................................................................................................................... 12

Fuzzy search ..................................................................................................................................................... 13

Log files 21GB .................................................................................................................................................. 14

Unstructured text proximity search with regular expression 22GB ................................................................ 14

Leading wildcard search and regular expression examples on 65GB ............................................................. 15

Page 4: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 3

SQL Query Extensions

The Connector includes extensions that provide the ability to execute certain functions that are not in the SQL-

92 standard. These extensions do not require any code modifications for an SQL application. The supported

extensions are:

• Regular Expression search (PCRE2)

• Fuzzy Hamming search

• Fuzzy Edit Distance search

• Point in Polygon (PIP)

Regular Expression The regular expression search adheres strictly to the PCRE2 regular expression rules. BlackLynx supports the

totality of the PCRE2 specification as described here as of June 5, 2017. PCRE2 is a standards-based regular

expression format that is heavily used by the search and analytics community for a variety of important search

use cases, including cyber use cases. PCRE2

Fuzzy Hamming Search The Fuzzy Hamming search operation works similarly to an exact search except that matches do not have to be

exact. Instead, the fuzziness parameter allows the specification of a "close enough" value to indicate how close

the input must be to match the search criteria. The match string can be up to 32 bytes in length. A "close

enough" match is specified as a Hamming distance.

The Hamming distance between two strings of equal length is the number of positions at which the

corresponding symbols are different. As provided to the fuzzy search operation, the Hamming distance specifies

the maximum number of substitutions that are allowed in order to declare a match. In addition, similar to exact

search, the surrounding mechanism can aid in downstream analysis of contextual use of the fuzzy match results

against unstructured raw text data

Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search that does not require two strings to be of equal length to obtain a

match. Instead of considering individual symbol differences, fuzzy edit distance search counts the minimum

number of insertions, deletions and replacements required to transform one string into another. This can make

it much more powerful than Fuzzy Hamming search for certain applications.

Compare Fuzzy Hamming to Fuzzy Edit Distance Let’s conduct a search for the string “Michelle” to compare Fuzzy Hamming with Fuzzy Edit Distance, using an

edit distance = 1.

Search String Fuzzy Hamming Fuzzy Edit Distance, Edit Distance = 1

Michelle Yes. Exact match. Yes. Exact match.

Mishelle Yes. “c” changed to “s”. Yes. “c” changed to “s”.

Mischelle No. The string “chelle” does not appear

in the same position as in the original

search term “Michelle”. This makes it an

Yes. Can insert the single character “c”.

Page 5: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 4

Search String Fuzzy Hamming Fuzzy Edit Distance, Edit Distance = 1

edit distance of 6, which is greater than

the specified distance.

Michele No. Deleting one “l” shortens the string

by one character, and the match string

must be of equal length.

Yes. One “l” deleted.

Mischele No. Although of equal length, more than

one change is required to match.

No. Requires 2 changes: add an “s” and

remove an “l”.

NOTE: If the edit distance = 2, then it

would match since the calculated edit

distance between the 2 strings is less

than or equal to the desired edit

distance (2).

Fuzzy edit distance is an extremely powerful search tool for a variety of data sources, including names,

addresses, medical records searching, genomic and disease research data, common misspellings, and more.

Unlike Fuzzy Hamming search, Fuzzy Edit Distance is a more natural fuzzy search paradigm for many algorithms,

since it does not require string matches to be of the same size.

Point in Polygon Search The PIP Search operation can be used to isolate data by longitude and latitude, comparing positions against

arbitrary complex polygons of arbitrary numbers of vertices. These searches require the input data to be CSV,

XML or JSON formatted.

Since a record is either inside or outside of a given complex polygon, only CONTAINS and NOT_CONTAINS are

supported relational operators. The SQL LIKE will translate to CONTAINS in the BlackLynx query and SQL NOT

LIKE will translate to NOT_CONTAINS.

By default, the primitive uses an exclusive construct, meaning that results contain points that are fully inside the

described complex polygon. An option (INCLUSIVE) exists if it is desired that points on the polygon boundary

itself are also to be considered inside the polygon.

In order to define the polygon which will be used for the query a VERTEX_LIST or VERTEX_FILE is used. By default, VERTEX_LIST describes a polygon using a format of: long,lat;long,lat;long,lat;...

Points that define a polygon cannot be listed in any arbitrary order. The requirement is that adjacent points must define an edge, including the wrap from bottom to top. The following example would describe a compliant bounding box with four vertices mapping to a portion of the Washington, DC metro area: VERTEX_LIST="-77.305425,38.789232;-76.823540,38.789232;-76.823540,39.037929;-

77.305424,39.037929"

Page 6: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 5

VERTEX_FILE provides an alternate mechanism for describing a complex polygon, using a specified input text file which contains one point per line, longitude followed by latitude (on the same line), separated and/or surrounded by one or more whitespace characters. The following example uses the same polygon vertices shown above with VERTEX_LIST, but instead specified by a file “polygon_points.txt”: VERTEX_FILE="/path/to/my_polygon_points.txt"

The contents of the file might be: $ cat /path/to/my_polygon_points.txt

-77.305425 38.789232

-76.823540 38.789232

-76.823540 39.037929

-77.305424 39.037929

Note that VERTEX_FILE can be very useful for very large polygons with many hundreds of vertices, such as

those that might describe state boundaries, voting districts, international boundaries, oil and gas exploration

boundaries, or other arbitrary areas of interest describable by sets of vertices defined by “longitude, latitude”

pairs.

Polygons can be grouped in trivial fashion. This is accomplished by setting the option VERTEX_FILE_IS_FILELIST to true, in which case the VERTEX_FILE specifies a filename that is a list of files, one per line, which each describe individual polygons with the same conventions noted above. An example might resemble: VERTEX_FILE="/path/to/my_list_of_polygon_files.txt", VERTEX_FILE_IS_FILEIST="true

Any Field Search The ANY field search allows the SQL query to execute a raw text search on all fields in a record structure

regardless of which field is specified by the query. This is particularly useful when the user does not which field

in the record may contain the data. The format of the expression is:

select …. . .. where <any field> like ‘-a<x>(<expression>)’

Where:

<any field> is any valid field in the record,

<-> is the BlackLynx indicator of a SQL extension

a|A symbolizes that operation is on “ALL” fields of the structure

<x> is the current extension. Valid values are r|R, e|E, or h|H

<expression> is expression searched

NOTE: This type of query extension can only be used when the structured files have line based records. That is

one record per line in the structured file. It should be noted that XML files rarely one line per record and PCAP is

not a frame per line, so this extension will not be applicable in those cases.

SQL Query Extensions Syntax

Regular Expression SQL Syntax The SQL syntax is modified in the “where” clause match statement as follow:

select <xxx> from <table> where <column> like '-r(<pcre2 expression>)'

Where:

Page 7: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 6

• -r or -R denotes regular expression search.

Fuzzy Search SQL Syntax The SQL syntax is modified in the “where” clause match statement as follow:

select <xxx> from <table> where <column> like '-<fuzzy_type><distance>(<term_to_match>)'

Where:

• <fuzzy_type> = h|H denoting hamming search or e|E denoting edit distance search.

• <distance> = integer, max = 255. The fuzziness of the search up to a maximum of 255 when using a fuzzy search function. For fuzzy hamming search, fuzziness is measured as the maximum Hamming distance allowed in order to declare a match. For fuzzy edit distance search, fuzziness is measured as the number of insertions, deletions or replacements required to declare a match.

Example: The following command executes a fuzzy edit distance search (distance=2)

select * from Passengers where Name like '-E2(Michelle)'

Search Surrounding Width parameter The surrounding Width parameter enables you to specify the number of characters, in bytes up to a maximum

of 262,144, before and after the match that will be returned when using text search. NOTE: Width is only used

for unstructured files queries and is useful in providing context to the match.

The syntax is modified in the “where” clause match statement as follow:

select … where <column> like '-<fuzzy_type><distance>(<term_to_match>)-W<width>'

Where:

• <width> = integer denotes the number of bytes (characters) before and after the match.

Example: The following command executes a regular expression search with a surrounding width of 20.

select * from wikipedia where Results like '-r(beautiful (\w+ ){0,5}world)-W20'

The result includes 20 bytes/characters before and after the regular expression match.

Case Insensitive “Where” Clause The Connector provides the ability to execute a query that is case-insensitive or case-sensitive. By default, all

queries are case-sensitive. The case-insensitive selection is made by using the “-i” parameter in the “where”

clause.

Example: The following matches the name “Michelle” in any combination of upper or lower case.

select * from Passengers where Name like '(Michelle)-i'

PIP SQL Syntax The SQL where clause for doing a PIP query is specified as:

select . . . WHERE <combined location|latitude|longitude> like '-

p(VERTEX_FILE=”<file path to VERTEX file>”[,options])'

Where:

Page 8: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 7

-p or -P denotes PIP search.

<file path to Vertex file> is the path to a file that contains the polygon vertices or a list of files that

contain polygon vertices. (The vertex file is appended to the end of VERTEX_FILE_PATH in the

“.ryftone.server.ini” file. If not specified in the in the “.ryftone.server.ini” file, then the full path must be

given.

Options are a comma separated list of named values that describe either the vertex file or modify the

PIP operation:

• VERTEX_FILE_IS_FILELIST="true|false" parameter DEFAULT is set to “false” and

not required to specify unless it denotes a path to a file. Then it must be specified to “true”.

• INCLUSIVE=”true|false” where the default is false. Describes how the software will

treat cases that fall exactly on the polygon boundary.

• FORMAT_POLYGON="LONG_LAT|LAT_LONG" where the default is “LONG_LAT”. This must be

specified if the vertex files’ contain longitude data then latitude.

The .meta.table must contain a parameter, PIP_FORMAT which is used by the PIP primitive. This data is a

string that defines the location and formatting of the latititude and longitude data for the structured records. There are basically 2 types of geodata that can be found; split fields and a combined field.

Split fields:

‘LAT_COORD="<field>", LONG_COORD="<field>"’ where <field> is represents either the column # (CSV) or column name (JSON or XML) for the fields representing latitude and longitude values in the table.

Combined field:

‘FORMAT_DATA=”LONG_LAT”’ for tables with combined lat/long fields where the longitude is the first number, or

‘FORMAT_DATA=”LAT_LONG”’ for tables with combined lat/long fields where the latitude is the first number

PIP SQL Query Example

Here are several examples of custom SQL PIP queries. The first example is using the Chicago Crime dataset,

which in CSV format. The query denotes a search on the data and then subsequently searching the results

against a couple of vertex files. The PIP search will match all points which are contained within the donut like

boundary specified.

select Primary_Type, Block, Latitude, Longitude, Location from

Chicago_Crime_CSV where Primary_Type like 'ASSAULT' and Location like '-

p(VERTEX_FILE="/ryftone/miscTestFiles/polygons/chicago/chi-outer.vf")' and

Location not like '-

p(VERTEX_FILE="/ryftone/miscTestFiles/polygons/chicago/chi-inner.vf")'

Page 9: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 8

The following example denotes a query using the “Any” field extension on a regular expression search and

combining the results with a PIP search. Note that the search found the match on the s_mac field, regardless of

the request specifying the i_mac field.

select adid_value,i_mac, s_mac, s_lat, s_lon, freshness from sfdata where i_mac like '-

ar(68:?72:?51:?5e:?7d:?17)' and s_lat like '-

p(VERTEX_FILE="/ryftone/miscTestFiles/polygons/nigeria/nigeria-lagos.txt",

VERTEX_FILE_IS_FILELIST="true")'

PCAP SQL Query Example

Page 10: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 9

Here is an example of a custom Tableau Dashboard with PCAP (protocol capture binary) dataset. The query is

done in a single step with no indexing or merging separate queries, as would be typically done with an SQL table.

The dashboard executed the following queries on the PCAP binary dataset.

▪ select ip_dst, http_req_uri, count(ip_dst) as IP_Dest_Count from PCAP where http_req_method = 'GET'

and http_req_version = 'HTTP/1.1' group by ip_dst, http_req_uri

▪ select frame_time, count(*) as N from PCAP where http_req_method = 'GET' group by frame_time

Page 11: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 10

Regular Expression on Logs Example

This is an example of a searching the log_data files for matches containing IP addresses outside the 172.31

domain and provide context of surrounding text up to 20 characters before and after the match.

select * from log_data where Results like '-r(host:..(?!ip-172-31-.*)(ip[-0-

9]*).)-w20'

Page 12: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 11

Edit Distance Search Example

SQL Statement: select * from Passengers where Name like '-E2(Michelle)'

Page 13: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 12

Sample SQL queries

Below is a list of sample SQL Queries to try out with the sample data on the BlackLynx server.

Raw pcap data set 16GB select ip_src, ip_dst, http_req_uri from PCAP where ip_dst = '34.238.50.30' and

http_req_method = 'GET'

select * from PCAP where ip_dst = '34.238.50.30' and payload like '-r(561-69-

\d{4})'

select ip_src, ip_dst, http_req_uri from PCAP where ip_dst = '34.238.50.30' and

http_req_method = 'GET'

select * from PCAP where ip_dst = '34.238.50.30' and payload like '-r(561-69-

\d{4})'

Other datasets There are other datasets available which can be installed on the server for demonstration purposes. A sample

list is outlined below.

XML Dataset Search using wildcards where the match can occur anywhere in the column data.

select * from Chicago_Crime where Description LIKE '%ELECTRONIC%' and Block LIKE

'%INDIANA%'

Page 14: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 13

Fuzzy search Levenshtein fuzzy search with distance of 2.

select * from Passengers where Name like '-E2(Michelle Jones)'

Page 15: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 14

Log files 21GB Example of a negative assertion regular expression search. The word 'statistic' followed by any number of

characters, the string 'host:' and two characters, an AWS private DNS address, excluding any addresses on our

subnet ('ip-172-31-')

select * from logs where Results like '-r(host:..(?!ip-172-31-.*)(ip[-0-9]*).)-w20'

Unstructured text proximity search with regular expression 22GB Searching an unstructured file with a regular expression proximity search. In this case, searching for the words

“beautiful” followed by “world” with 0 to 5 words in between.

select * from wikipedia where Results like '-r(beautiful (\w+ ){0,5}world)-w20'

Page 16: BlackLynx SQL Getting Started Guide · Unstructured Table of Contents ... against unstructured raw text data Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search

BlackLynx ODBC/JDBC SQL Extensions

BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 15

Leading wildcard search and regular expression examples on 65GB select * from CDR where msisdn like '%%%%%%951354'

select * from CDR where msisdn like '-r((886682|8866271)\d{6})'