Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
BLACKLYNX SQL
GETTING STARTED
GUIDE Using BlackLynx SQL Extensions
Version 1.3
ABSTRACT Use the BlackLynx ODBC/JDBC Connector to
interface your data to your business analytics,
business intelligence, or business visualization
applications without the need for indexing nor ETL
June 21, 2019
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 1
Revisions:
Date Reason for Change ODBC Version
April 2019 Update for compatibility with Blacklynx Restful server revision 1.3.0
Add sections on PCAP and PIP primitives
Add “any” field search extension
2.7.0.2
March 2019 Added Centos 7.6 support
Added field_delimiter parameter for CSV type files
2.6.45.4
December 2018 Added PCAP support 2.6.45.1
Use the BlackLynx ODBC/JDBC Connector to interface your data to your business analytics, business intelligence,
or business visualization applications without the need for indexing or ETL (extract, transform and load). The
connector will interface with data in the following formats:
PCAP (protocol capture binary)
XML
CSV
JSON
Unstructured
Table of Contents Revisions: .................................................................................................................................................................1
SQL Query Extensions .................................................................................................................................................3
Regular Expression .................................................................................................................................................3
Fuzzy Hamming Search ...........................................................................................................................................3
Fuzzy Edit Distance Search .....................................................................................................................................3
Compare Fuzzy Hamming to Fuzzy Edit Distance ...................................................................................................3
Point in Polygon Search ..........................................................................................................................................4
Any Field Search .....................................................................................................................................................5
SQL Query Extensions Syntax .....................................................................................................................................5
Regular Expression SQL Syntax ...............................................................................................................................5
Fuzzy Search SQL Syntax .........................................................................................................................................6
Search Surrounding Width parameter ...................................................................................................................6
Case Insensitive “Where” Clause ...........................................................................................................................6
PIP SQL Syntax ........................................................................................................................................................6
PIP SQL Query Example ..............................................................................................................................................7
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 2
PCAP SQL Query Example ...........................................................................................................................................8
Regular Expression on Logs Example ...................................................................................................................... 10
Edit Distance Search Example ................................................................................................................................. 11
Sample SQL queries ................................................................................................................................................. 12
Raw pcap data set 16GB ...................................................................................................................................... 12
Other datasets ..................................................................................................................................................... 12
XML Dataset .................................................................................................................................................... 12
Fuzzy search ..................................................................................................................................................... 13
Log files 21GB .................................................................................................................................................. 14
Unstructured text proximity search with regular expression 22GB ................................................................ 14
Leading wildcard search and regular expression examples on 65GB ............................................................. 15
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 3
SQL Query Extensions
The Connector includes extensions that provide the ability to execute certain functions that are not in the SQL-
92 standard. These extensions do not require any code modifications for an SQL application. The supported
extensions are:
• Regular Expression search (PCRE2)
• Fuzzy Hamming search
• Fuzzy Edit Distance search
• Point in Polygon (PIP)
Regular Expression The regular expression search adheres strictly to the PCRE2 regular expression rules. BlackLynx supports the
totality of the PCRE2 specification as described here as of June 5, 2017. PCRE2 is a standards-based regular
expression format that is heavily used by the search and analytics community for a variety of important search
use cases, including cyber use cases. PCRE2
Fuzzy Hamming Search The Fuzzy Hamming search operation works similarly to an exact search except that matches do not have to be
exact. Instead, the fuzziness parameter allows the specification of a "close enough" value to indicate how close
the input must be to match the search criteria. The match string can be up to 32 bytes in length. A "close
enough" match is specified as a Hamming distance.
The Hamming distance between two strings of equal length is the number of positions at which the
corresponding symbols are different. As provided to the fuzzy search operation, the Hamming distance specifies
the maximum number of substitutions that are allowed in order to declare a match. In addition, similar to exact
search, the surrounding mechanism can aid in downstream analysis of contextual use of the fuzzy match results
against unstructured raw text data
Fuzzy Edit Distance Search Fuzzy Edit Distance Search performs a search that does not require two strings to be of equal length to obtain a
match. Instead of considering individual symbol differences, fuzzy edit distance search counts the minimum
number of insertions, deletions and replacements required to transform one string into another. This can make
it much more powerful than Fuzzy Hamming search for certain applications.
Compare Fuzzy Hamming to Fuzzy Edit Distance Let’s conduct a search for the string “Michelle” to compare Fuzzy Hamming with Fuzzy Edit Distance, using an
edit distance = 1.
Search String Fuzzy Hamming Fuzzy Edit Distance, Edit Distance = 1
Michelle Yes. Exact match. Yes. Exact match.
Mishelle Yes. “c” changed to “s”. Yes. “c” changed to “s”.
Mischelle No. The string “chelle” does not appear
in the same position as in the original
search term “Michelle”. This makes it an
Yes. Can insert the single character “c”.
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 4
Search String Fuzzy Hamming Fuzzy Edit Distance, Edit Distance = 1
edit distance of 6, which is greater than
the specified distance.
Michele No. Deleting one “l” shortens the string
by one character, and the match string
must be of equal length.
Yes. One “l” deleted.
Mischele No. Although of equal length, more than
one change is required to match.
No. Requires 2 changes: add an “s” and
remove an “l”.
NOTE: If the edit distance = 2, then it
would match since the calculated edit
distance between the 2 strings is less
than or equal to the desired edit
distance (2).
Fuzzy edit distance is an extremely powerful search tool for a variety of data sources, including names,
addresses, medical records searching, genomic and disease research data, common misspellings, and more.
Unlike Fuzzy Hamming search, Fuzzy Edit Distance is a more natural fuzzy search paradigm for many algorithms,
since it does not require string matches to be of the same size.
Point in Polygon Search The PIP Search operation can be used to isolate data by longitude and latitude, comparing positions against
arbitrary complex polygons of arbitrary numbers of vertices. These searches require the input data to be CSV,
XML or JSON formatted.
Since a record is either inside or outside of a given complex polygon, only CONTAINS and NOT_CONTAINS are
supported relational operators. The SQL LIKE will translate to CONTAINS in the BlackLynx query and SQL NOT
LIKE will translate to NOT_CONTAINS.
By default, the primitive uses an exclusive construct, meaning that results contain points that are fully inside the
described complex polygon. An option (INCLUSIVE) exists if it is desired that points on the polygon boundary
itself are also to be considered inside the polygon.
In order to define the polygon which will be used for the query a VERTEX_LIST or VERTEX_FILE is used. By default, VERTEX_LIST describes a polygon using a format of: long,lat;long,lat;long,lat;...
Points that define a polygon cannot be listed in any arbitrary order. The requirement is that adjacent points must define an edge, including the wrap from bottom to top. The following example would describe a compliant bounding box with four vertices mapping to a portion of the Washington, DC metro area: VERTEX_LIST="-77.305425,38.789232;-76.823540,38.789232;-76.823540,39.037929;-
77.305424,39.037929"
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 5
VERTEX_FILE provides an alternate mechanism for describing a complex polygon, using a specified input text file which contains one point per line, longitude followed by latitude (on the same line), separated and/or surrounded by one or more whitespace characters. The following example uses the same polygon vertices shown above with VERTEX_LIST, but instead specified by a file “polygon_points.txt”: VERTEX_FILE="/path/to/my_polygon_points.txt"
The contents of the file might be: $ cat /path/to/my_polygon_points.txt
-77.305425 38.789232
-76.823540 38.789232
-76.823540 39.037929
-77.305424 39.037929
Note that VERTEX_FILE can be very useful for very large polygons with many hundreds of vertices, such as
those that might describe state boundaries, voting districts, international boundaries, oil and gas exploration
boundaries, or other arbitrary areas of interest describable by sets of vertices defined by “longitude, latitude”
pairs.
Polygons can be grouped in trivial fashion. This is accomplished by setting the option VERTEX_FILE_IS_FILELIST to true, in which case the VERTEX_FILE specifies a filename that is a list of files, one per line, which each describe individual polygons with the same conventions noted above. An example might resemble: VERTEX_FILE="/path/to/my_list_of_polygon_files.txt", VERTEX_FILE_IS_FILEIST="true
Any Field Search The ANY field search allows the SQL query to execute a raw text search on all fields in a record structure
regardless of which field is specified by the query. This is particularly useful when the user does not which field
in the record may contain the data. The format of the expression is:
select …. . .. where <any field> like ‘-a<x>(<expression>)’
Where:
<any field> is any valid field in the record,
<-> is the BlackLynx indicator of a SQL extension
a|A symbolizes that operation is on “ALL” fields of the structure
<x> is the current extension. Valid values are r|R, e|E, or h|H
<expression> is expression searched
NOTE: This type of query extension can only be used when the structured files have line based records. That is
one record per line in the structured file. It should be noted that XML files rarely one line per record and PCAP is
not a frame per line, so this extension will not be applicable in those cases.
SQL Query Extensions Syntax
Regular Expression SQL Syntax The SQL syntax is modified in the “where” clause match statement as follow:
select <xxx> from <table> where <column> like '-r(<pcre2 expression>)'
Where:
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 6
• -r or -R denotes regular expression search.
Fuzzy Search SQL Syntax The SQL syntax is modified in the “where” clause match statement as follow:
select <xxx> from <table> where <column> like '-<fuzzy_type><distance>(<term_to_match>)'
Where:
• <fuzzy_type> = h|H denoting hamming search or e|E denoting edit distance search.
• <distance> = integer, max = 255. The fuzziness of the search up to a maximum of 255 when using a fuzzy search function. For fuzzy hamming search, fuzziness is measured as the maximum Hamming distance allowed in order to declare a match. For fuzzy edit distance search, fuzziness is measured as the number of insertions, deletions or replacements required to declare a match.
Example: The following command executes a fuzzy edit distance search (distance=2)
select * from Passengers where Name like '-E2(Michelle)'
Search Surrounding Width parameter The surrounding Width parameter enables you to specify the number of characters, in bytes up to a maximum
of 262,144, before and after the match that will be returned when using text search. NOTE: Width is only used
for unstructured files queries and is useful in providing context to the match.
The syntax is modified in the “where” clause match statement as follow:
select … where <column> like '-<fuzzy_type><distance>(<term_to_match>)-W<width>'
Where:
• <width> = integer denotes the number of bytes (characters) before and after the match.
Example: The following command executes a regular expression search with a surrounding width of 20.
select * from wikipedia where Results like '-r(beautiful (\w+ ){0,5}world)-W20'
The result includes 20 bytes/characters before and after the regular expression match.
Case Insensitive “Where” Clause The Connector provides the ability to execute a query that is case-insensitive or case-sensitive. By default, all
queries are case-sensitive. The case-insensitive selection is made by using the “-i” parameter in the “where”
clause.
Example: The following matches the name “Michelle” in any combination of upper or lower case.
select * from Passengers where Name like '(Michelle)-i'
PIP SQL Syntax The SQL where clause for doing a PIP query is specified as:
select . . . WHERE <combined location|latitude|longitude> like '-
p(VERTEX_FILE=”<file path to VERTEX file>”[,options])'
Where:
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 7
-p or -P denotes PIP search.
<file path to Vertex file> is the path to a file that contains the polygon vertices or a list of files that
contain polygon vertices. (The vertex file is appended to the end of VERTEX_FILE_PATH in the
“.ryftone.server.ini” file. If not specified in the in the “.ryftone.server.ini” file, then the full path must be
given.
Options are a comma separated list of named values that describe either the vertex file or modify the
PIP operation:
• VERTEX_FILE_IS_FILELIST="true|false" parameter DEFAULT is set to “false” and
not required to specify unless it denotes a path to a file. Then it must be specified to “true”.
• INCLUSIVE=”true|false” where the default is false. Describes how the software will
treat cases that fall exactly on the polygon boundary.
• FORMAT_POLYGON="LONG_LAT|LAT_LONG" where the default is “LONG_LAT”. This must be
specified if the vertex files’ contain longitude data then latitude.
The .meta.table must contain a parameter, PIP_FORMAT which is used by the PIP primitive. This data is a
string that defines the location and formatting of the latititude and longitude data for the structured records. There are basically 2 types of geodata that can be found; split fields and a combined field.
Split fields:
‘LAT_COORD="<field>", LONG_COORD="<field>"’ where <field> is represents either the column # (CSV) or column name (JSON or XML) for the fields representing latitude and longitude values in the table.
Combined field:
‘FORMAT_DATA=”LONG_LAT”’ for tables with combined lat/long fields where the longitude is the first number, or
‘FORMAT_DATA=”LAT_LONG”’ for tables with combined lat/long fields where the latitude is the first number
PIP SQL Query Example
Here are several examples of custom SQL PIP queries. The first example is using the Chicago Crime dataset,
which in CSV format. The query denotes a search on the data and then subsequently searching the results
against a couple of vertex files. The PIP search will match all points which are contained within the donut like
boundary specified.
select Primary_Type, Block, Latitude, Longitude, Location from
Chicago_Crime_CSV where Primary_Type like 'ASSAULT' and Location like '-
p(VERTEX_FILE="/ryftone/miscTestFiles/polygons/chicago/chi-outer.vf")' and
Location not like '-
p(VERTEX_FILE="/ryftone/miscTestFiles/polygons/chicago/chi-inner.vf")'
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 8
The following example denotes a query using the “Any” field extension on a regular expression search and
combining the results with a PIP search. Note that the search found the match on the s_mac field, regardless of
the request specifying the i_mac field.
select adid_value,i_mac, s_mac, s_lat, s_lon, freshness from sfdata where i_mac like '-
ar(68:?72:?51:?5e:?7d:?17)' and s_lat like '-
p(VERTEX_FILE="/ryftone/miscTestFiles/polygons/nigeria/nigeria-lagos.txt",
VERTEX_FILE_IS_FILELIST="true")'
PCAP SQL Query Example
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 9
Here is an example of a custom Tableau Dashboard with PCAP (protocol capture binary) dataset. The query is
done in a single step with no indexing or merging separate queries, as would be typically done with an SQL table.
The dashboard executed the following queries on the PCAP binary dataset.
▪ select ip_dst, http_req_uri, count(ip_dst) as IP_Dest_Count from PCAP where http_req_method = 'GET'
and http_req_version = 'HTTP/1.1' group by ip_dst, http_req_uri
▪ select frame_time, count(*) as N from PCAP where http_req_method = 'GET' group by frame_time
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 10
Regular Expression on Logs Example
This is an example of a searching the log_data files for matches containing IP addresses outside the 172.31
domain and provide context of surrounding text up to 20 characters before and after the match.
select * from log_data where Results like '-r(host:..(?!ip-172-31-.*)(ip[-0-
9]*).)-w20'
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 11
Edit Distance Search Example
SQL Statement: select * from Passengers where Name like '-E2(Michelle)'
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 12
Sample SQL queries
Below is a list of sample SQL Queries to try out with the sample data on the BlackLynx server.
Raw pcap data set 16GB select ip_src, ip_dst, http_req_uri from PCAP where ip_dst = '34.238.50.30' and
http_req_method = 'GET'
select * from PCAP where ip_dst = '34.238.50.30' and payload like '-r(561-69-
\d{4})'
select ip_src, ip_dst, http_req_uri from PCAP where ip_dst = '34.238.50.30' and
http_req_method = 'GET'
select * from PCAP where ip_dst = '34.238.50.30' and payload like '-r(561-69-
\d{4})'
Other datasets There are other datasets available which can be installed on the server for demonstration purposes. A sample
list is outlined below.
XML Dataset Search using wildcards where the match can occur anywhere in the column data.
select * from Chicago_Crime where Description LIKE '%ELECTRONIC%' and Block LIKE
'%INDIANA%'
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 13
Fuzzy search Levenshtein fuzzy search with distance of 2.
select * from Passengers where Name like '-E2(Michelle Jones)'
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 14
Log files 21GB Example of a negative assertion regular expression search. The word 'statistic' followed by any number of
characters, the string 'host:' and two characters, an AWS private DNS address, excluding any addresses on our
subnet ('ip-172-31-')
select * from logs where Results like '-r(host:..(?!ip-172-31-.*)(ip[-0-9]*).)-w20'
Unstructured text proximity search with regular expression 22GB Searching an unstructured file with a regular expression proximity search. In this case, searching for the words
“beautiful” followed by “world” with 0 to 5 words in between.
select * from wikipedia where Results like '-r(beautiful (\w+ ){0,5}world)-w20'
BlackLynx ODBC/JDBC SQL Extensions
BlackLynx SQL Getting Started Guide BlackLynx, Inc. ©2018 Page | 15
Leading wildcard search and regular expression examples on 65GB select * from CDR where msisdn like '%%%%%%951354'
select * from CDR where msisdn like '-r((886682|8866271)\d{6})'