Definition

Particular data gathered in the real world could be spoiled. It sounds a bit perplexing but it rarely occurs. Sometimes somewhere something goes on the wrong path and disrupts data accuracy. For example could you imagine what happen if…

  • Network connection dramatically failed down;
  • Data source would generate a fake data;
  • Data source generates a noise data in less than 3% occasions;
  • etc.

Exactly to prevent data corruption you're restricted to check its integrity on a regular base. Ideally it might been implemented immediately after each insertion/update operation but it a tedious task even for computational machines. This approach performs 99.99% average data quality but demands a lot of processing time. On a flip side of coin exactly this required for NASA calculations of effective orbit for Juno or even for something much precise operations.

The much simpler approach might been implemented as DB integrity check every minute, or day, or week. Combined with easy-backup strategy it is much plain and less bounded by CPU load.

Documentation

Data itself has several innate properties, such as value and amount. If you pay attention on the table data representation the first one which you might notice is database normalization term which intertwines data itself, relations, tables and databases.

But, it is a completely ideal and perfect case, an untouchable target, which you should strive to achieve or at least to take an attempt to follow to keep your data in a dense state.

Prerequisites

  • SQLite3 database;
  • Emacs with the particular properties.

Check strategy

Overview

DB location

Just unique fields only. Uniqueness is matter.

#+BEGIN_SRC sh :exports both
ls /home/vikky/Desktop/DVCS/stuff/scrapy/activesport/
#+END_SRC

#+RESULTS:
| activesport    |
| log1.txt       |
| log.txt        |
| pseudo-log.txt |
| scrapy.cfg     |
| scrapy_data.db |

DB schema

Lets connect to DB:

#+BEGIN_SRC sqlite :db /home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db :results output :exports both
.schema
#+END_SRC

#+RESULTS:
: CREATE TABLE activesport(id INTEGER PRIMARY KEY, title TEXT);

DB content

#+BEGIN_SRC sqlite :db /home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db :exports both :colnames yes
select * from activesport limit 10;
#+END_SRC

#+RESULTS:
| id | title                                                       |
|----+-------------------------------------------------------------|
|  1 | Wishbone Bike 3 in 1 Recycled Edition Balance Bike          |
|  2 | Wishbone Bike 2 in 1 Recycled Edition Balance Bike          |
|  3 | Nicko Flower Pink Wooden Balance Bike - Running Bike        |
|  4 | Giant HalfWay City 2015                                     |
|  5 | Wishbone Bike 3 in 1 - Original Balance Bike                |
|  6 | CBR Hopper Training / Balance Bike 2015                     |
|  7 | Giant Prime E+ 2 2015                                       |
|  8 | Giant TCR Advanced 1 2015                                   |
|  9 | Raleigh Red or Dead Starstruck Women's Classic Shopper Bike |
| 10 | Raleigh Stowaway 7 Folding Bike                             |

The main tool in SQLite3 toolbox for check data integrity in whole DB:

#+BEGIN_SRC sqlite :db /home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db :results output :exports both
pragma integrity_check;
#+END_SRC

#+RESULTS:
: ok

Count the number of rows in table:

#+BEGIN_SRC sqlite :db /home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db :results output :exports both
select count(*) from activesport;
select count(title) from activesport;
#+END_SRC

#+RESULTS:
: 981
: 981

Are the titles unique?

Thus, ensure yourself: try to insert the existing title into our DB and re-count number of rows.

#+BEGIN_SRC sqlite :db /home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db :results output :exports both
select title from activesport where id=1;
insert into activesport values(((select count(*) from activesport) + 1), (select title from activesport where id=1));
-- re-count titles
select count(title) from activesport;
delete from activesport where id=(select count(*) from activesport);
select count(*) from activesport;
select count(title) from activesport;
#+END_SRC

#+RESULTS:
: "GT Bump 26 Jump Bike 2016"
: 982
: 981
: 981

Seems like select count(*) in SQLite3 not provide an uniqueness check under particular data.

Python remedy:

Extract all titles, count its number, convert into set and count its length.

#+BEGIN_SRC python :results output :session stdlib :exports both
  import sqlite3

  conn = sqlite3.connect('/home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db')
  c = conn.cursor()
  c.execute('SELECT COUNT(*) FROM activesport')
  num_rows = c.fetchone()[0]
  c.execute('SELECT title FROM activesport')
  title_set = set(t[0] for t in c.fetchall())
  print("There are %d titles and %d are unique" % (num_rows, len(title_set)))
  conn.close()
#+END_SRC

#+RESULTS:
: 
: >>> >>> >>> <sqlite3.Cursor object at 0xb703d220>
: >>> <sqlite3.Cursor object at 0xb703d220>
: ... >>> ... ... There are 981 titles and 981 are unique

Proof

  • Insert the existing title and check the uniqueness;
  • Delete the test row and recover DB into its initial state.

    #+BEGIN_SRC python :results output pp :session stdlib :exports both
      import sqlite3
    
      conn = sqlite3.connect('/home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db')
      c = conn.cursor()
      c.execute('SELECT COUNT(*) FROM activesport')
      num_rows = c.fetchone()[0]
      c.execute('SELECT title from activesport WHERE id=1')
      existing_title = c.fetchone()[0]
      print(existing_title)
      next_row = num_rows + 1
      print(next_row)
      c.execute('INSERT INTO activesport values (?, ?)', (next_row, existing_title))
      conn.commit() # it is necessary after each INSERT/UPDATE operation
      c.execute('SELECT title FROM activesport')
      # collect all titles
      title_set = set(t[0] for t in c.fetchall())
      c.execute('SELECT COUNT(*) FROM activesport')
      num_rows_new = c.fetchone()[0]
      print("There are %d titles and %d are unique" % (num_rows_new, len(title_set)))
      # clean-up
      c.execute('DELETE FROM activesport WHERE id>?', (num_rows,))
      conn.commit()
      conn.close()
    #+END_SRC
    
    #+RESULTS:
    #+begin_example
    
    >>> >>> >>> <sqlite3.Cursor object at 0xb71f9120>
    >>> <sqlite3.Cursor object at 0xb71f9120>
    >>> Wishbone Bike 3 in 1 Recycled Edition Balance Bike
    >>> 982
    <sqlite3.Cursor object at 0xb71f9120>
    >>> <sqlite3.Cursor object at 0xb71f9120>
    ... >>> <sqlite3.Cursor object at 0xb71f9120>
    >>> There are 982 titles and 981 are unique
    ... <sqlite3.Cursor object at 0xb71f9120>
    #+end_example
    

There are 982 titles and 981 are unique

Code refactoring

Now, when we already unveiled all subtleties around uniqueness it is time to utilize the core features of org-mode sessions and synthesize a much nicer code.

Create a function:

#+BEGIN_SRC python :results output :session stdlib :exports both
  # suppose that sqlite3 was already imported in the current session
  def uniq_field_test(conn, tbl, field):
      c = conn.cursor()
      query = 'SELECT COUNT(*) FROM %s' % tbl 
      c.execute(query)
      num_rows = c.fetchone()[0]
      query = 'SELECT %s FROM %s' % (field, tbl)
      c.execute(query)
      title_set = set(t[0] for t in c.fetchall())
      return ("There are %d titles and %d are unique" % (num_rows, len(title_set)))

  conn = sqlite3.connect('/home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db')
  print(uniq_field_test(conn, 'activesport', 'title'))
  conn.close()
#+END_SRC

#+RESULTS:
: 
: ... ... ... ... ... ... ... ... ... >>> >>> There are 981 titles and 981 are unique

Now, when the function uniq_field_test allocates in the org session:

   #+BEGIN_SRC python :results output pp :session stdlib :exports both
     conn = sqlite3.connect('/home/vikky/Desktop/DVCS/stuff/scrapy/activesport/scrapy_data.db')
     c = conn.cursor()
     tbl = 'activesport'
     field = 'title'
     query = 'SELECT COUNT(*) FROM %s' % tbl
     c.execute(query)
     num_rows = c.fetchone()[0]
     query = 'SELECT %s FROM %s WHERE id=1' % (field, tbl)
     c.execute(query)
     existing_value = c.fetchone()[0]
     # fetchone() returns a tuple and we cut off the first element
     print(existing_value)
     print(type(existing_value))
     next_row = num_rows+1
     print(next_row)
     c.execute('INSERT INTO activesport VALUES (?, ?)', (next_row, existing_value))
     # INSERT demands table straight definition
     # hence pre-formatted query not substitutes in the INSERT case
     conn.commit() # writes data into table 
     print(uniq_field_test(conn, tbl, field))
     query = 'DELETE FROM %s WHERE id=%d' % (tbl, next_row)
     c.execute(query)
     conn.commit()
     print(uniq_field_test(conn, tbl, field))
     conn.close()
   #+END_SRC

   #+RESULTS:
   #+begin_example

   >>> >>> >>> >>> <sqlite3.Cursor object at 0xb70051e0>
   >>> >>> <sqlite3.Cursor object at 0xb70051e0>
   >>> ... Wishbone Bike 3 in 1 Recycled Edition Balance Bike
   <class 'str'>
   >>> 982
   <sqlite3.Cursor object at 0xb70051e0>
   ... ... >>> There are 982 titles and 981 are unique
   >>> <sqlite3.Cursor object at 0xb70051e0>
   >>> There are 981 titles and 981 are unique
#+end_example

Results analyze

Code samples above confirm that it is possible to insert duplicate data in our table.

Conclusion

This particular database structure allows to user deform data integrity. There are several innate RDMBS methods to prevent this unwanted distortion. And I'll take an attempt to discover them in the following articles.



blog comments powered by Disqus

Published

30 August 2016

Categories

org-mode Python SQLite3 literate programming

Tags