Previously, this was the piece that caused the most frustration and time lost. Let’s see where this new decision tree takes us! For this file, we want to only select author objects that match the book authors in our smaller book demo file. Therefore…
Yes, our scenario involves data. No, we don’t need to operate on the row/field structure.* Yes, we want to operate on the value with conditional logic (if authors match ones in book file, then keep, otherwise discard).
*Note: Though we do want to evaluate based on the field, we actually are not looking to change the overall structure of the JSON object or the file structure itself. That will stay in tact.
This leads us to use a database, where we then have three steps to get the author data we need.
Import necessary data
Select the subset we want in the resulting file
Export the subset of authors
I used Neo4j graph database as my database choice, although other options work. Neo4j makes several of these steps very easy, but to avoid learning a new database for the sake of some demo data cleaning, I’d recommend you use whatever database you’re comfortable with as a starting point.
Back to our decision tree. Neo4j has a few tools that actually allowed me to combine the "yes" and "no" sides of that final decision, but I recommend putting all your decision logic in a query language. It is what they are designed to do, after all! You can run filtering to gather the authors related to the ten thousand books, then tag those somehow - separate table, new collection, different label, etc. Then, you can use a database tool to dump that segment without additional criteria.
For my Neo4j case, I imported the ten thousand book file to the database, then used a database utility procedure to select those books and only import authors that matched. This means the only authors in my database were the ones I needed. Then, I used a shell tool that ran a query to select those entities, and piped that data to a database utility tool that exported the data to an external file. Because I was using a database export tool, I had to export as plain text, which meant that there was some formatting cleanup to do. Time to consult the decision tree again!
Yes, our situation involves data. Yes, we want to operate on row/field structure. This is due to escape characters (
\) before delimeters, so we want to remove those on rows and fields, changing the structure from plain text to standard JSON. This puts us on command line tools, and we are not working with structured data (it is plain text). This means we can use a built-in Linux tool like the
tr command to remove the escape character.