Duke deduplication engine : exact same record not matched

Duke deduplication engine : exact same record not matched - java

I am attempting to use Duke to match records from one csv to another.First csv and second both has ID,Model,Price,CompanyName,Review,Url columns. I am trying to match to another csv to find duplicates records.
package no.priv.garshol.duke;
import no.priv.garshol.duke.matchers.PrintMatchListener;
public class RunDuke {
public static void main(String[] argv) throws Exception {
Configuration config =
ConfigLoader
.load("/home/kishore/Duke-master/doc/example-data/presonalCare.xml");
Processor proc = new Processor(config);
proc.addMatchListener(new PrintMatchListener(true, true, true, false, config.getProperties(),
true));
proc.link();
proc.close();
}
}
Here is an example of personalCare.xml:
<!-- language: xml -->
<!-- For more information, see https://github.com/larsga/Duke/wiki/ Improvements
needed: - some area numbers have spaces in them - not stripping accents from
names -->
<duke>
<schema>
<threshold>0.7</threshold>
<property type="id">
<name>ID</name>
</property>
<property>
<name>Model</name>
<comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
<low>0.4</low>
<high>0.8</high>
</property>
<property>
<name>Price</name>
<comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
<low>0.04</low>
<high>0.73</high>
</property>
<property>
<name>CompanyName</name>
<comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
<low>0.4</low>
<high>0.8</high>
</property>
<property>
<name>Review</name>
<comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
<low>0.12</low>
<high>0.93</high>
</property>
<property>
<name>Url</name>
<comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
<low>0.12</low>
<high>0.93</high>
</property>
</schema>
<database class="no.priv.garshol.duke.databases.InMemoryDatabase">
</database>
<group>
<csv>
<param name="input-file" value="personal_care_11.csv" />
<param name="header-line" value="false" />
<column name="1" property="ID" />
<column name="2" property="Model" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="3" property="Price" />
<column name="4" property="CompanyName" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="5" property="Review" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="6" property="Url" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
</csv>
</group>
<group>
<csv>
<param name="input-file" value="personal_care_11.csv" />
<param name="header-line" value="false" />
<column name="1" property="ID" />
<column name="2" property="Model" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="3" property="Price" />
<column name="4" property="CompanyName" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="5" property="Review" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="6" property="Url" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
</csv>
</group>
</duke>
The above code is working fine but it does not match the exact record example
STHDRNFKAQ4AFYE8,Littmann 3M Classic II S.E Acoustic
Stethoscope,6297,Littmann,,http://dl.flipkart.com/dl/littmann-3m-classic-ii-s-e-acoustic-stethoscope/p/itme3uhzbqxhzfda?pid=STHDRNFKAQFAFYE8&affid=3ba0de4902524e2b90e43b84b89ea0ef
which is in both csv files. I also want to know the work of low and high property value which is given in .xml file, how to decide the low and high value for column value.

You are doing record linkage (two data sets) and not deduplication (single data set), so take out the .deduplicate() call.
Also, please don't use the 'no.priv.garshol.duke' package name. You should never use domain names you don't own yourself.
Anyway, the reason you can't find any matches is that the two records have the same ID. Duke verifies that it's not reporting records as matching themselves, and so the match gets filtered out. If you make a copy of the csv file and use that for group 2, then make a change to the ID then Duke finds the duplicate.
Here's what happens when I try that:
[lars.garshol#laptop tmp]$ java -cp ~/cvs-co/duke/duke-core/target/duke-core-1.3-SNAPSHOT.jar:. no.priv.garshol.duke.Duke --showmatches presonalCare.xml
MATCH 0.9982630751840313
ID: 'SHDRNFKAQ4AFYE8', Model: 'littmann 3m classic ii s.e acoustic stethoscope', Price: '6297', CompanyName: 'littmann', Url: 'http://dl.flipkart.com/dl/littmann-3m-classic-ii-s-e-acoustic-stethoscope/p/itme3uhzbqxhzfda?pid=sthdrnfkaqfafye8&affid=3ba0de4902524e2b90e43b84b89ea0ef',
ID: 'STHDRNFKAQ4AFYE8', Model: 'littmann 3m classic ii s.e acoustic stethoscope', Price: '6297', CompanyName: 'littmann', Url: 'http://dl.flipkart.com/dl/littmann-3m-classic-ii-s-e-acoustic-stethoscope/p/itme3uhzbqxhzfda?pid=sthdrnfkaqfafye8&affid=3ba0de4902524e2b90e43b84b89ea0ef',

Related

Importing data from CSV using loadDate liquibase. Unknown columns creating an error

Let's say I have data.csv file with some data.
name, lastName, something, somethingElse
John, Doe , drinking , eating
I'm creating a table
<changeSet id="1" author="me">
<createTable tableName="myTable">
<column name="name" type="varchar(255)"/>
<column name="lastName" type="varchar(255)"/>
<column name="something" type="varchar(255)"/>
<column name="somethingElse" type="varchar(255)"/>
</createTable>
</changeSet>
I want to insert my data from data.csv
<changeSet author="me" id="2" runAlways="true">
<loadUpdateData encoding="UTF-8"
file="db/data.csv"
schemaName="liq"
separator=","
tableName="myTable">
<column name="name" type="varchar(255)"/>
<column name="lastName" type="varchar(255)"/>
<column name="something" type="varchar(255)"/>
<column name="somethingElse" type="varchar(255)"/>
</loadUpdateData>
</changeSet>
This piece of code works fine. In my situation, separator is ",".
I want to figure out different scenerio.
I get data2.csv file with unfinished (unnamed) columns. It was created for future inserts.
It looks like below:
name, lastName, something, somethingElse,,,,
John, Doe , drinking , eating ,,,,
So I have some extra comas which are separating non existing(yet) columns.
Is that possible to configure liquibase to skip or to do not read those specific comas? In current situation I will get
Reason: liquibase.exception.UnexpectedLiquibaseException: Unreferenced unnamed column is not supported
I have hundreds of rows and all of them contain comas at the end so removing them manually is pointless and not allowed in my situation.
Summing up I want to create table with only existing columns and insert data from
csv file. This file has extra comas at the end of each row, unfortunately
in same time comas are my columns separator. I checked list of properties and I didn't find anything useful.

How to connect to a JDA server from Python

I am having a JDA server with connection details . I have to connect to this JDA server from my python program and execute MOCA commands. I have searched and haven't found any documentation on the same so far.
Found some jar files and all but nothing with python.My python client app has to connect to JDA and execute commands.
executed requests and got the session-key values. also executed the commands with session-key but output is not getting reflected.
Called this particular command to login with request body as.
<moca-request autocommit="True">
<environment>
<var name="USR_ID" value="super"/>
</environment>
<query>login user where usr_id = 'super' and usr_pswd = 'super'</query>
</moca-request>
I am able to login successfully and got the response as
<?xml version="1.0" encoding="UTF-8"?>
<moca-response>
<session-id></session-id>
<status>0</status>
<moca-results>
<metadata>
<column name="usr_id" type="S" length="0" nullable="true"/>
<column name="locale_id" type="S" length="0" nullable="true"/>
<column name="addon_id" type="S" length="0" nullable="true"/>
<column name="cust_lvl" type="I" length="0" nullable="true"/>
<column name="session_key" type="S" length="0" nullable="true"/>
<column name="pswd_expir" type="I" length="0" nullable="true"/>
<column name="pswd_expir_dte" type="D" length="0" nullable="true"/>
<column name="pswd_disable" type="I" length="0" nullable="true"/>
<column name="pswd_chg_flg" type="O" length="0" nullable="true"/>
<column name="pswd_expir_flg" type="O" length="0" nullable="true"/>
<column name="pswd_warn_flg" type="O" length="0" nullable="true"/>
<column name="srv_typ" type="S" length="0" nullable="true"/>
<column name="super_usr_flg" type="O" length="0" nullable="true"/>
<column name="ext_ath_flg" type="O" length="0" nullable="true"/>
</metadata>
<data>
<row>
<field>SUPER</field>
<field>US_ENGLISH</field>
<field>3pl,WM,SEAMLES,3pl</field>
<field>10</field>
<field>;uid=SUPER|sid=b6698786-85dc-41ec-9e54-c0d8f99b5cbf|dt=jttyorn7|sec=ALL;Hz1biv4HuD_Uq3g.R9QtCfwjQ0</field>
<field null="true"></field>
<field null="true"></field>
<field>6008</field>
<field>0</field>
<field>0</field>
<field>0</field>
<field>DEVELOPMENT</field>
<field>1</field>
<field>0</field>
</row>
</data>
</moca-results>
</moca-response>
I have taken the session key as;uid=SUPER|sid=b6698786-85dc-41ec-9e54-c0d8f99b5cbf|dt=jttyorn7|sec=ALL;Hz1biv4HuD_Uq3g.R9QtCfwjQ0 as per the xml response and tried executing the commands.
This is how i executed the commands
<moca-request autocommit="True">
<environment>
<var name="USR_ID" value="super"/>
<var name="SESSION_KEY" value=";uid=SUPER|sid=b6698786-85dc-41ec-9e54-c0d8f99b5cbf|dt=jttyorn7|sec=ALL;Hz1biv4HuD_Uq3g.R9QtCfwjQ0"/>
<var name="LOCALE_ID" value="US_ENGLISH"/>
<var name="MOCA_APPL_ID" value="MYAPP"/>
</environment>
<query>
create record where table = 'alt_prtmst' and prtnum = 'TEST1' and alt_prtnum = 'TEST123' and alt_prt_typ = 'SAP' and prt_client_id = '----' </query>
</moca-request>
the commands execute without any error and i am getting response also as.
<?xml version="1.0" encoding="UTF-8"?>
<moca-response>
<session-id></session-id>
<status>0</status>
</moca-response>
but changes are not getting reflected.
I also tried another moca command in query..
<query>
list warehouses
</query>
Even if it executes how to get the exact output back

I have interpreted your question that you are trying to connect to a JDA (WMS) instance.
I have created an application in NodeJs that connects to an instance and executes MOCA commands.
I am posting XML with the request header 'Content-Type': 'application/moca-xml' to <host>:<port>/service. The example XML body below will run the list user tables MOCA command.
<moca-request autocommit="True">
<environment>
<var name="USR_ID" value="..."/>
<var name="SESSION_KEY" value="..."/>
<var name="LOCALE_ID" value="EN-GB"/>
<var name="MOCA_APPL_ID" value="MYAPP"/>
</environment>
<query>list user tables</query>
</moca-request>
The SESSION_KEY can be taken from the repsponse of a login request, XML body below.
<moca-request autocommit="True">
<environment>
<var name="USR_ID" value="..."/>
</environment>
<query>login user where usr_id = '...' and usr_pswd = '...'</query>
</moca-request>

You can use this to connect to a discord server from python. An example follows:
import discord
from discord.ext import commands
import random
description = '''An example bot to showcase the discord.ext.commands extension
module.
There are a number of utility commands being showcased here.'''
bot = commands.Bot(command_prefix='?', description=description)
#bot.event
async def on_ready():
print('Logged in as')
print(bot.user.name)
print(bot.user.id)
print('------')
#bot.command()
async def add(left : int, right : int):
"""Adds two numbers together."""
await bot.say(left + right)
#bot.command()
async def roll(dice : str):
"""Rolls a dice in NdN format."""
try:
rolls, limit = map(int, dice.split('d'))
except Exception:
await bot.say('Format has to be in NdN!')
return
result = ', '.join(str(random.randint(1, limit)) for r in range(rolls))
await bot.say(result)
#bot.command(description='For when you wanna settle the score some other way')
async def choose(*choices : str):
"""Chooses between multiple choices."""
await bot.say(random.choice(choices))
#bot.command()
async def repeat(times : int, content='repeating...'):
"""Repeats a message multiple times."""
for i in range(times):
await bot.say(content)
#bot.command()
async def joined(member : discord.Member):
"""Says when a member joined."""
await bot.say('{0.name} joined in {0.joined_at}'.format(member))
#bot.group(pass_context=True)
async def cool(ctx):
"""Says if a user is cool.
In reality this just checks if a subcommand is being invoked.
"""
if ctx.invoked_subcommand is None:
await bot.say('No, {0.subcommand_passed} is not cool'.format(ctx))
#cool.command(name='bot')
async def _bot():
"""Is the bot cool?"""
await bot.say('Yes, the bot is cool.')
bot.run('token')
Hope it helps...

The filter mechanism in Hibernate hbm file is not very flexible for dynamic predicates

I am developing a Spring framework and hibernate application with a central database for an enterprise web application that has
about 1000 users online daily.
You can assume that there is a billing application and anybody can do anything on his own account (e.g. increase the amount of his billing or
decrease the amount of his billing).
Any user has its own data which is secured to the specific user by a mechanism of filtering in hbm files:
<?xml version="1.0"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
"http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping default-lazy="true">
<class name="org.myoffice.Inventory" table="Core_INVENTORY">
<id name="id" column="Id" type="java.lang.Long">
<generator class="sequence" >
<param name="sequence">SEQ_INVENTORY</param>
</generator>
</id>
<many-to-one name="bill" column="bill_ID" entity-name="org.myoffice.Bill" not-null="true" unique-key="unq_StrHouse_Smp_Pn_Exp"/>
<property name="expireDate" column="expire_Date" type="date" unique-key="unq_StrHouse_Smp_Pn_Exp"/>
<many-to-one name="user" column="user_id" entity-name="org.myoffice.User" not-null="true" update="false" />
<many-to-one name="createdBy" column="CreatedBy" entity-name="org.myoffice.User" not-null="true" update="false" />
<many-to-one name="updatedBy" column="UpdatedBy" entity-name="org.myoffice.User" not-null="true" />
<property name="createdDate" column="CreatedDate" type="date" not-null="true" update="false" />
<property name="updatedDate" column="UpdatedDate" type="date" not-null="true"/>
<property name="ip" column="IP" type="string" not-null="true"/>
<filter name="powerAuthorize" condition="[SQL QUERTY IS HERE FOR RESTRICTION ANY USER TO OWN DATA]"/>
</class>
</hibernate-mapping>
NOTE: The end of ([SQL QUERTY IS HERE FOR RESTRICTION ANY USER TO OWN DATA]) in above hbm is finished with a WHERE CLAUSE has userId parameter
for restricting a user to his own data and this userId is added the below method of generic repository.
And I add the powerAuthorize of hbm in my generic repository like this:
public void applyDefaultAuthorizeFilter(Session session) {
Filter filter = session.enableFilter("powerAuthorize");
filter.setParameter("userId", SecurityUtility.getAuthenticatedUser().getId());
}
This filter always add to end of any query for filtering data.
Everything had been working fine until the consumer of my application brought up a new demand which is not compatible with this current design. The consumer now wants to increase the billing of another user.
If I skip the filter hbm, any user will see the information of another user and if i dont skip the filter, I can't implement this new request.
Is there other mechanism, pattern or anything else I could use instead?

The #Filter is useful when the condition does not change, but just the bind parameter value can vary.
What you need here is to filter the WHERE clause predicate. Therefore, you need to move the filtering logic in your data access layer.
You write DAO methods to filter the Inventory based on user rights.
You remove the #Filter since the DAO methods will do that instead.
This design is much more flexible on the long term too.

You can use several filters and enable/disable them on demand. E.g. activate your "powerAuthorize" filter for all requests and just enable another filter "comprehensiveAuthorize" for your new requirement and disable "powerAuthorize" temporary.

Exception occurs if the position of discriminator tag in hibernate is moved down

I am new to Hibernate. I am trying to map both my super-class and sub-class to a single table.
<class name="Employee" table="EmpWithManager">
<id name="id" column="ID">
<generator class="native"></generator>
</id>
<discriminator column="EMP_TYPE" type="string"></discriminator>
<property name="firstName" column="FIRST_NAME"></property>
<property name="lastName" column="LAST_NAME"></property>
<property name="salary" column="SALARY"></property>
<subclass name="Manager" extends="Employee">
<property name="managerId" column="MAN_ID"></property>
<property name="noOfEmployees" column="NUMBER_EMP"></property>
</subclass>
</class>
This works fine but if change the position of the discriminator tag as follows:
<class name="Employee" table="EmpWithManager">
<id name="id" column="ID">
<generator class="native"></generator>
</id>
<property name="firstName" column="FIRST_NAME"></property>
<discriminator column="EMP_TYPE" type="string"></discriminator>
<property name="lastName" column="LAST_NAME"></property>
<property name="salary" column="SALARY"></property>
<subclass name="Manager" extends="Employee">
<property name="managerId" column="MAN_ID"></property>
<property name="noOfEmployees" column="NUMBER_EMP"></property>
</subclass>
</class>
This re-ordering gives me the below exception:
Caused by: org.xml.sax.SAXParseException: The content of element type "class" must match "(meta*,subselect?,cache?,synchronize*,comment?,tuplizer*,(id|composite-id),discriminator?,natural-id?,(version|timestamp)?,(property|many-to-one|one-to-one|component|dynamic-component|properties|any|map|set|list|bag|idbag|array|primitive-array)*,((join*,subclass*)|joined-subclass*|union-subclass*),loader?,sql-insert?,sql-update?,sql-delete?,filter*,fetch-profile*,resultset*,(query|sql-query)*)".
Please anybody tell me why this is happening and whether the position of discriminator should be in the beginning?

If you look at the http://hibernate.org/dtd/ entry for hibernate-mapping-3.0.dtd it defines the class element as follows. Order is important as this is a DTD. Note that discriminator? comes after (id|composite-id) and before the long entry with property. This ordering requirement is not explicitly mentioned in the (current) hibernate documentation.
<!ELEMENT class (
meta*,
subselect?,
cache?,
synchronize*,
comment?,
tuplizer*,
(id|composite-id),
discriminator?,
natural-id?,
(version|timestamp)?,
(property|many-to-one|one-to-one|component|dynamic-component|properties|any|map|set|list|bag|idbag|array|primitive-array)*,
((join*,subclass*)|joined-subclass*|union-subclass*),
loader?,sql-insert?,sql-update?,sql-delete?,
filter*,
fetch-profile*,
resultset*,
(query|sql-query)*
)>

According to the hibernate document type definitions (DTD) listed here, the position of the discriminator tag must be after the id tag. Essentially the structure of the xml document in this situation is pre-defined, and you must follow the pre-defined format, and that is why you see an error after moving the discriminator tag.
From the JBoss docs:
5.1.8 - Discriminator:
The <discriminator> element is required for polymorphic persistence using the table-per-class-hierarchy mapping strategy and declares a discriminator column of the table. The discriminator column contains marker values that tell the persistence layer what subclass to instantiate for a particular row. A restricted set of types may be used: string, character, integer, byte, short, boolean, yes_no, true_false.
I'd imagine that you must define how properties will be discriminated against before you define them and that is the reasoning for the structure within the DTD.

Remove comments in XML android

I'm working in eclipse and I would like to remove all the commentaries in my XML file in my code.
When there is no comment in my XML file, my application works. But as soon as I have any commentary in my XML file, my application stop working.
My question is : in android java, is there any way to remove with a code all the commentaries in my XML file?
I searched some keywords on google, but i don't find anything for java, only for C# or VB.
In those languages, they use something like "node" (I didn't really understood how it was working)
I join an example of XML file where there is the type of comment that I want to remove.
<?xml version="1.0" encoding="utf-8"?>
<pma_xml_export version="1.0" xmlns:pma="http://www.phpmyadmin.net/some_doc_url/">
<database name="mvc">
<!-- XML commentaries that i want to remove -->
<table name="hrd">
<column name="No_HRD">1</column>
<column name="Nom_HRD">H?tel Fleuritel</column>
<column name="Rue_HRD"> 1 Bd Jean Delautre</column>
<column name="CP_HRD">08000</column>
<column name="Ville_HRD"> Charleville-M?zi?re</column>
<column name="Tel_HRD"> 03 24 37</column>
<column name="Fax_HRD">03 24 37 5</column>
<column name="Email_HRD"> </column>
<column name="Image_HRD">1</column>
<column name="Site_HRD"> http://www.hotel-fleuritel.com</column>
<column name="No_Categorie_HRD">H</column>
<column name="No_Ville_HRD">1</column>
</table>
<!-- XML commentaries that i want to remove -->
<table name="hrd">
<column name="No_HRD">2</column>
<column name="Nom_HRD">Premi?re Classe CHARLEVILLE-MEZIERES</column>
<column name="Rue_HRD">Route de la Francheville ZAC Du MOULIN-LE-BLANC</column>
<column name="CP_HRD">08000</column>
<column name="Ville_HRD"> Charleville-M?zi?re</column>
<column name="Tel_HRD">08 92 70 7</column>
<column name="Fax_HRD"> 03 24 37</column>
<column name="Email_HRD">charlevillemezieres#premiereclasse.fr </column>
<column name="Image_HRD">3</column>
<column name="Site_HRD">http://www.premiere-classe-charleville-mezieres.fr</column>
<column name="No_Categorie_HRD">H</column>
<column name="No_Ville_HRD">1</column>
<!-- XML commentaries that i want to remove -->
</table>
<!-- XML commentaries that i want to remove -->
</database>
</pma_xml_export>

You don't have to get them out. You can just skip them. It is explained on the android development site. On this page.
If you go to the "read the feed" part you'll see that it will just skip it if it doesn't recognise it as a tag. If you do it this way it will just skip the commentary. I implemented this in my own project recently. So I can confirm that it works.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Duke deduplication engine : exact same record not matched - java

Related

Importing data from CSV using loadDate liquibase. Unknown columns creating an error

How to connect to a JDA server from Python

The filter mechanism in Hibernate hbm file is not very flexible for dynamic predicates

Exception occurs if the position of discriminator tag in hibernate is moved down

Remove comments in XML android

Categories

Resources