Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 126 Vote(s) - 3.42 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Harvesting Email Addresses from a Website using Java

#1
Github:

[To see links please register here]


In this tutorial we'll be creating a small java command line application to extract email addresses from websites, a program like this comes in handy for people who are into advertising and stuff.

So before we jump right into programming, lets think about the possible steps of the program.

As we are extracting emails from a website so we are definitely going to be asking the user to input the URL of the website. Once we have the website, we won't magically have all the emails but we will have to get the contents of the URL first. Now that we have the contents, how are we going to extract the emails? Yes you have guessed it right, we will definitely be using REGEX.

So the list of steps are:


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


Now that we have a basic layout of our program, lets start coding part by part and we'll add possible improvements on the way but first we will create our EmailExtractor class.


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.



Handling URL

We will be initializing the EmailExtractor with a URL which the user will input via command line arguments but we will cover that part in the end, for now we will create the constructor for the EmailExtractor which will take a URL as an argument and then initialize the URL Object.

So the code is


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


If you are new to the Java URL Class, it simply allows us to open a connection to the specified URL and then read data from it. You must specify the protocol (http/https) in the URL otherwise URL will throw the MalformedURLException, hence we will enclose the statement in Try..Catch.


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


Getting URL Contents

In the previous section we initialized our URL object to hold the user URL, what now we need to do is read the contents of the URL and store it inside a variable so that we can later apply regex and extract email addresses from it.

Lets create a method readContents which will read the contents off from the URL. It uses a BufferReader to read the InputStream from the URL object and then save the contents in a StringBuilder variable.


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


The url.openStream() basically opens the connection with the URL, then returns an InputStream so that we can read the data from the URL, The BufferedReader reads block of characters from the InputStreamReader.

The readContents method is complete but there's a problem, if the URL supplied by the user is in correct format but doesn't actually exist on the internet, the url.openStream() will throw an IOException hence we need to handle that exception too, so we just surround the whole block with Try..Catch.


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


Now if the user enters a URL like

[To see links please register here]

which doesn't exist, our program will throw an exception Unable to read URL due to Unknown Host..

Extracting Email Addresses Using REGEX

When we obtain the contents of the URL, it'll be in a messy HTML form, Using a regular expression pattern for email address, we can find out the matching strings in the content.

The regular expression used is: \b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b

We will create a extractEmail method which will use regex to search for email addresses in the contents and once it gets a hit, it'll store that email address inside an String ArrayList but due to the fact that sometimes emails might get repeated so to maintain uniqueness, we will use Set Data Structure.


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


Printing out Email Addresses

To print out email addresses to the command line from the emailAddresses set, we will create a method printAddresses


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


The printAddresses method will first check that if the Set is not empty i.e there are emails in the Set, if there are then it'll print all of the email address and if no email addresses were found in the website contents i.e. the Set, containing the email addresses, size is zero then it'll print No emails were extracted!

Saving Email Addresses to a Text File (Extra)

Suppose a site you just scraped contains 1000 email address and all of em gets printed on your terminal, it's time consuming and annoying to copy and paste them, scroll down. So instead we can create a method called saveAddresses which will save all the extracted email address to a file with the name that the user assigns it.


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


The File object creates a new text file by the name that is passed as an argument to the method, the BufferedWriter will write the the email addresses to the text file, each on a newline. In case there's a problem writing the file, the Try..Catch block will handle the IO exception.

Command Line Arguments

So we are basically done creating our EmailExtractor class and it's essential methods. Now what we need to do is handle our user inputs. Time for us to create our main method,


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.

You've written this piece of code thousands of times yet have you ever wondered what String[] args mean?

String[] args is simply an array of Strings, that contains command line arguments passed by the user.
Initially the args array is empty,


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


So when you type in your terminal something like:


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


The args array becomes,


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


Since it's a typical Java Array, we can access the passed arguments using index. So if I wanted to see what the first argument the user has passed, I would simply do


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


And it will print hello.

Another thing to keep in mind is that args is just the name of the array, you can name it anything like String[] myArguments but it's recommended that you follow the convention and keep it as String[] args.

Handling Command Line Arguments

For our application here, we will have 2 arguments


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


Out of which the first argument is necessary and the 2nd one is optional, whether you want to save the file or not. When you run the app with just the URL as the argument, it'll extract the email addresses and print them by default. But if you want to save those email addresses to a text file you need to add an extra argument followed by another argument that is the name of the file,


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


-s is the argument that will indicate that the user wants to save the file, and emails is the name of that file. So our main method becomes


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.

We are checking in the If condition that the arguments are supplied by the user specially the URL, other wise if the url is not supplied , it'll simply tell the user Invalid Arguments Supplied... Now if the URL is included as arg and -s along with the file name is also been input by the user then we will save the email addresses in a file using the saveAddresses method else the list of email addresses will be simply displayed. Now our code becomes,


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


And our Email Extractor is complete, Build the jar file using NetBeans and run this command on the terminal:


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


It'll produce the following output:

[Image: screenshot-from-2016-08-25-23-26-53.png?w=756]
The complete code:


Hidden Content
You must

[To see links please register here]

or

[To see links please register here]

to view this content.


This tutorial was fun, I'll write a separate tutorial about Command Line Arguments in Java so that if you have any kind of confusion regarding that topic, you can clear it up. Have fun coding :smile:

Procurity:

[To see links please register here]


Regards,
Ex094[/hide]
Reply

#2
I need to shift my syntax away from PHP and SQL and apply myself here.

I've bookmarked this, and shall delve Into It when time permits. A very well documented, elaborated and formatted guide.
Thanks @"Ex094", appreciated.
Reply

#3
Nice one, hope to see more.
Reply

#4
[Image: thumbs-up-192.png]

Create or replace function hugs ( word Varchar2)
As
v_hugs_string varchar2();
Begin
v_hugs_string:=word;
For i IN REVERSE 1..LENGHT(v_hugs_string) LOOP
v_hugs_string:=v__hugs_string|| Substring(v_hugs_string,i,1);
End LOOP;
DBMS_OUTPUT.PUT_LINE('Your reverse String '||v_hugs_string);
DBMS_OUTPUT.PUT_LINE('You're very helpfull thanks to share your knowledge');
End hugs;

begin
hugs('Blessyou brother');
end;
Reply

#5
Really cool tutorial! Awesome job. :biggrin:

Also: if you want to work with HTML content in Java,

[To see links please register here]

is a great library for that.
Reply

#6
Quote:(09-13-2016, 02:36 PM)Xiledcore Wrote:

[To see links please register here]

Really cool tutorial! Awesome job. :biggrin:

Also: if you want to work with HTML content in Java,

[To see links please register here]

is a great library for that.

I vouch...!
Reply

#7
Awesome tutorial. You can monetize easily off of this.
Reply

#8
Let's see where this takes us.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through