HOWTO: Building a JSON Database API (3)

Be aware (!), while following thru on the following, that there might be a change you might be blocked (seen as a DOS thread) from Wikipedia sites. There is a Wikipedia API:Etiquette page that shows some of the limitations already in place, but given how I created the API, I won’t hit any request limit as far as I can see…

Ok, the last bits and pieces, creating the JSON “Scraper” package and its body.

Scraping the bits and pieces

As explained, the request will give a aicontinue value for the next picture in line. I also need a time setting/lag to be not that aggressive for the Wikipedia API. Although initially inspired by Gerard Venzl’s posts about alternative code solutions and GitHub code, I didn’t want a [CTRL-BREAK] kind of solution to stop the processing.

The package description (scrapper & sleep)

--
CREATE OR REPLACE PACKAGE JSON_SCRAPER 
AS
  -- --------------------------------------------------------------------------
  -- Version    : 1.0
  -- Created    : Marco Gralike (MG)
  -- Purpose    : JSON Wikimedia Scraper demo
  -- --------------------------------------------------------------------------
  PROCEDURE JSON_WIKI_SCRAPER(
      p_maxruntime     NUMBER DEFAULT 60000 ,
      p_sleep          NUMBER DEFAULT 1000,
      p_page_reference VARCHAR2 DEFAULT 'oracle',
      p_output boolean DEFAULT false);
  -- --------------------------------------------------------------------------
  -- Purpose:
  -- Scraping JSON content from the Wikipedia / MediaWiki API
  -- Maxruntime and sleep are in milliseconds. 60000 = 1 min, 1000 = 1 second.
  -- --------------------------------------------------------------------------  
  PROCEDURE SLEEP(
      x_millis IN NUMBER);
  -- --------------------------------------------------------------------------
  -- Purpose:
  -- Alternative SLEEP implementation without the need for SYS grants
  -- x_millis represents milliseconds, so a number like 15000 is equal to 15s.
  -- --------------------------------------------------------------------------
END JSON_SCRAPER;
/
--

The package body has the following content

--
CREATE OR REPLACE PACKAGE body JSON_SCRAPER
AS
  -- ----------------------------------------------------------------
  PROCEDURE JSON_WIKI_SCRAPER(
      p_maxruntime     NUMBER DEFAULT 60000 ,
      p_sleep          NUMBER DEFAULT 1000,
      p_page_reference VARCHAR2 DEFAULT 'oracle',
      p_output         BOOLEAN DEFAULT false)
  AS
    --
    v_semaphore json_scraper_config.started%type;
    v_child_name_escaped scraped_content_continue.child_name_escaped%type;
    v_pkid scraped_content_continue.pkid%type;
    v_count NUMBER DEFAULT 0;
    v_url   VARCHAR2(32767);
    --
  BEGIN
    --
    IF (p_output) THEN
      dbms_output.enable (buffer_size => NULL);
    END IF;
    --
    UPDATE json_scraper_config
    SET started        = 'TRUE'
    WHERE target_table = 'JSON_SCRAPED_CONTENT'
    AND started       <> 'TRUE';
    --
    COMMIT;
    --
    LOOP
      -- -----------------------------------------------------------------
      -- Keep running until column started is updated with value!='FALSE'
      -- -----------------------------------------------------------------
      SELECT started
      INTO v_semaphore
      FROM json_scraper_config;
      --
      EXIT
    WHEN v_semaphore <> 'TRUE';
      --
      v_count:=v_count+1;
      --
      IF (v_count*p_sleep) >= p_maxruntime THEN
        EXIT;
      END IF;
      --
      IF v_count=1 THEN
        --
        SELECT url_json_api
          ||url_parameter_static
          ||p_page_reference
        INTO v_url
        FROM JSON_SCRAPER_CONFIG;
        --
        INSERT INTO json_scraped_content
          (json
          )
        SELECT httpuritype(v_url).getBlob() AS "JSON" FROM dual;
        --
      ELSE
        --
        SELECT child_name_escaped,
          pkid
        INTO v_child_name_escaped,
          v_pkid
        FROM scraped_content_continue
        ORDER BY pkid DESC
        FETCH FIRST 1 ROWS ONLY;
        --
        IF (p_output) THEN
          dbms_output.put_line('Escaped format['||v_pkid||']: '||v_child_name_escaped);
        END IF;
        --
        SELECT REPLACE(url_json_api
          ||url_parameter_static,url_parameter_dynamic,url_parameter_dynamic
          ||v_child_name_escaped)
        INTO v_url
        FROM JSON_SCRAPER_CONFIG;
        --
        INSERT INTO json_scraped_content
          (json
          )
        SELECT httpuritype(v_url).getBlob() AS "JSON" FROM dual;
        --
      END IF;
      --
      COMMIT;
      SLEEP(p_sleep);
      --
      IF (p_output) THEN
        dbms_output.put_line('Running for '||(v_count*p_sleep/1000)||' seconds...');
        dbms_output.put_line('URL: '||v_url);
      END IF;
      --
    END LOOP;
    --
  END JSON_WIKI_SCRAPER;
  -- ----------------------------------------------------------------
  -- Alternative SLEEP method without need for SYS privileges
  -- ----------------------------------------------------------------
  PROCEDURE SLEEP(
      x_millis IN NUMBER)
  AS
    LANGUAGE JAVA NAME 'java.lang.Thread.sleep(int)';
    -- ----------------------------------------------------------------
END JSON_SCRAPER;
/
--

Implementing a sleep method this way, is that I won’t need additional privileges on packages in the database.

  1. In the first bit of the code I grab the “status” from the JSON_SCRAPER_CONFIG table to see if I, the package, is running. If not, I update the status column in the JSON_SCRAPER_CONFIG table to “TRUE”. As long as the PL/SQL package is running, and not is exits based on a set time limits restriction, it will run (loop!) continuously until the status column has a value other then “TRUE”.
  2. The first time, I will grab the initial starting values, parameters, from the config table, fetch the first Wikipedia JSON content and insert the JSON content in table JSON_SCRAPED_CONTENT and then…
  3. Second time around, etc., grab continuously the content and info needed from the SCRAPED_CONTENT_CONTINUE view based on the Top-N Query returning always the value for the last “parent” and “child” picture info,
  4. …in between I created a “sleep” period (default 1 second) before the package loops and gets the next JSON content via actions described in point 3.

And that’s it.

More or less. It’s probably also good to notice that I use the URL_PARAMETER_DYNAMIC value in the JSON_SCRAPER_CONFIG table as the needed info to change the &aifrom={parent picture} into the &aifrom={child picture} (which is the parent value the next time around).

And really, that’s it. 😉

Running the package now via the following (for 1 hour total, 1 second sleep in between)…

--
DECLARE
  P_MAXRUNTIME NUMBER;
  P_SLEEP NUMBER;
  P_PAGE_REFERENCE VARCHAR2(200);
  P_OUTPUT BOOLEAN;
BEGIN
  P_MAXRUNTIME := 3600000;
  P_SLEEP := 1000;
  P_PAGE_REFERENCE := 'a';
  P_OUTPUT := true;

  JSON_SCRAPER.JSON_WIKI_SCRAPER(
    P_MAXRUNTIME => P_MAXRUNTIME,
    P_SLEEP => P_SLEEP,
    P_PAGE_REFERENCE => P_PAGE_REFERENCE,
    P_OUTPUT => P_OUTPUT
  );
--rollback; 
END;
/
--

…creates 3599 rows with JSON content in my JSON_SCRAPED_CONTENT table, which shows like this, via SQL*Developer directly (remember BLOB column?!)
Screen Shot 2015-05-01 at 20.12.46

…but like this via view SCRAPED_JSON_4000_MAX…
Screen Shot 2015-05-01 at 20.14.25
Remember the chopping of content bigger than varchar2(n) and replacing it with NULL in the JSON_TABLE columns section?

… and like this via my created view SCRAPED_CONTENT…
Screen Shot 2015-05-01 at 20.16.42

Stop! Security?

One thing of course is good to know that before you can run the package, which basically uses via HTTPURITYPE package UTL_HTTP under the covers which uses… anyway, you must create DBMS_ACL_ADMIN ACL’s and granted access. Know that you can do this via the following code (based on the fact that the user/schema is called JSON_WIKI):

--
BEGIN
  DBMS_NETWORK_ACL_ADMIN.drop_acl ( acl => 'json_wiki.xml');
  COMMIT;
END;
/
--
BEGIN
  dbms_network_acl_admin.create_acl ( acl => 'json_wiki.xml'
                                    , description => 'Allow external access'
                                    , principal => 'JSON_WIKI'
                                    , is_grant => TRUE
                                    , privilege => 'connect'
                                    , start_date => SYSTIMESTAMP
                                    , end_date => NULL);
  COMMIT;
END;
/
--
BEGIN
  dbms_network_acl_admin.add_privilege ( acl => 'json_wiki.xml'
                                       , principal => 'JSON_WIKI'
                                       , is_grant => TRUE
                                       , privilege => 'resolve' );
  COMMIT;
END;
/
--
BEGIN
  dbms_network_acl_admin.assign_acl( acl => 'json_wiki.xml',
  host => '*' );
  COMMIT;
END;
/
--
COMMIT;
--

And that’s really it for now.

Readme?!

Here is the needed “scrapper.sql” or if you want to have a look now the text extension version:

So don’t forget to follow the initial remarks mentioned in the file

  1. create a database user JSON_WIKI
  2. grant dba, xdbadmin to JSON_WIKI
  3. connect as JSON_WIKI
  4. run the SQL file
  5. execute the dbms_acl_admin portion mentioned
  6. run your JSON scraper…

That should do it.

Marco Gralike Written by: