Login

Ok, it's been a long while since I started this series, and it's time for another round. This time we're going to go over your first ARM programs, using just basic instructions. This practice will form the foundations of every program you will ever write for ARM. For the most part, you're only going to be using a couple instructions, and we'll discuss how to optimize them in a later part. For now, let's just go over what they are.

So, As I discussed in

[To see links please register here]

of this series--ARM, like all RISC processors, uses a load-store architecture. This is a fundamental change to how you write assembly code if you have any experience with an x86, z80, or similar architecture. What that means is that all of the instructions that do work can ONLY do work on register arguments. For example, something like the following would be valid for x86:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

which would mean
add the value of EAX with the value from a variable held at the top of the stack, and place it in EAX

This is invalid for ARM. Our add instruction is not able to interface with memory. So that brings us to the first fundamental difference.

Loading and Storing
So, since ARM can't interface directly with memory in working instructions, we need a way to be able to get data from memory, and put it back. We do that using the LDR and STR instructions. They're pretty simple, and they're one of the very few instructions for ARM that takes 2 operands. Remember that ARM normally takes 3 operands: the destination, the source, and op-n.
The LDR instruction (short for "load register") will retrieve a value from memory and place it in the destination register. To accomplish the same task as we did above, we'd do the following:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok, that was pretty simple. And yes, it is one more instruction over the x86 variant, but that doesn't mean it's slower. Please keep in mind when writing ARM programs that each line/instruction you write will take exactly 1 clock cycle, wheras x86 takes anywhere from 1 to 320 clock cycles.
STR - the STR instruction is the reverse of LDR, and it's short for "store register". This is the second half to the load-store architecture. Let's look at the following x86 program:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

What that does is it adds the value of EAX to the variable held at the top of the stack. In x86, this is a pretty short, but potentially confusing (to some) sequence of instructions. It takes around 7 clock cycles to complete.
We can use our LDR and STR instructions now to make the same snippet for ARM:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

This snippet takes 3 clock cycles.

Now, let's talk about another fundamental difference with ARM:

Preserving arithmetic values
Since ARM is a 3-operand system, we can actually preserve the value of R0/EAX when we do our addition. To do this for x86, we would need to do a bunch of crazy stack operations:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

This uses at least 11 clock cycles, and it looks messy as all hell. We need to do this, because x86's instructions all act as <op>=, meaning ADD is actually += on the first operand. ARM doesn't work that way, it's destination = source + op-n. With this, we can preserve R0 without having to use memory in the process:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, R0 has been preserved, because instead of R0 = R0 + R1, we now have R1 = R0 + R1, effectively R1 += R0. This is extremely useful if we want to use the value of R0 just a couple lines down, we don't have to store it in a different register or push it to the stack in order to save its value. Our snippet still uses exactly 3 clock cycles. This means, that in contrast to the x86 example, our code will run 8 cycles faster, which at 1ghz = 125us of saved time. Those microseconds will add up. This is why your ARM processor in your phone is able to run very intensive programs very quickly, even on a 1-2ghz processor, whereas your desktop computer might be a little laggy at 3.5ghz for medium load applications.

Now, some of you might have been wondering why I was using R13 instead of SP. That's because I wanted to get the thought in your mind that aliases exist for registers. This brings us to another key difference between RISC and x86 (CISC).

Lots of registers
ARM has 15 general purpose registers. This means that you can put whatever you want in them, and your program will still run (for the most part). Our x86 counterpart only has 4. This means that for x86, you will need to put a lot more data into RAM when you don't need it this second, and you waste all of that time having to push and pull it as you work. Each of these registers is 32 bits long. For the sake of consistency I'm basing this series on ARM 32-bit, since it's more common, and x86 rather than AMD64. So, that means with x86, you have a total of 128 bits, or 16 bytes of register storage on the chip. That's 16 bytes of data you are limited to at any given time, without having to wait for memory to get more data. In ARM, since you have 15 general purpose registers, you have 480 bits, or 60 bytes of data you can work with. That's 275% MORE data! This means less time waiting for memory, and less memory used for your programs. Remember that the stack is a place in memory. Of course, it's not advised for you to use all 15 of these registers, though you can. At any given time, you should only modify 14 of them, and I'll tell you why.
With x86, you have those 4 registers (EAX, EBX, ECX, EDX) that you can play with, and they will work with the majority of the instructions. You can put data into EDI, EBP, and ECS if you want to, but they won't actually work with all of the instructions, making them only useful for temporarily storing something rather than pushing it (though you still need to move it back and forth, which costs CPU time). Here in ARM-land, we don't have that limitation. We can use 15 of our 16 total registers with any instruction we like. We can use the 16th register with at least half of the instructions. However, there is a catch. ARM treats all registers (except R15) as a user register. This means that all of those registers can operate on data, but they might have other purposes. Exactly 3 of the 16 registers have a primary purpose other than storing computations. These are done by register aliases. Here's what they are:
R13 => SP (the stack pointer)
R14 => LR (the link register, we'll get to this one shortly)
R15 => PC (the program counter, aka the instruction pointer)
So, like I said. You can put data in all of these, with the exception of R15 (if you put a computation in that, bad things will happen). But, you really don't ever want to overwrite the stack pointer, so you'd want to leave that one alone. This direct access to these registers makes it very easy to do some awesome hacks with your code. For example, you can write your program in such a way that when you make the program counter out of (4-byte) alignment by one byte, the program changes to another valid program that you want. This means that you can write a program that only uses 1/4 of the memory it might normally need! You would just increment R15 by one and jump back to the start each time, and it would run over the same bytes, but in a slightly different order each time.
I do want to note, that R0 - R12 have no other primary purpose. At any time, you can modify these to whatever you feel like. This makes up 52 bytes of truly free storage on the CPU die.

Now, I mentioned something in the last section about the link register, so let's talk about that for a bit:

The link register (R14)
With x86, when you call a subroutine, the CPU pushes the address of the next instruction onto the stack for you. This takes up time, and it also means that you have to have very careful control over your stack or your program will break (and it leaves room for attackers). However, it does provide you with the means to "return" (RET) out of your subroutine, and you'd store your return value in EAX. ARM is a little bit different. The ABI for ARM not only doesn't store this value on the stack (saving a little memory and some time), but actually allows you TWO return values (R0 and R1). The second key bit is because ARM has lots of registers, you don't need to push your arguments onto the stack either. You supply them by registers (up to 12 arguments), saving you even more time and memory.
So, how does ARM know where to return to?
Short answer: it doesn't, the programmer is in charge of that. ARM doesn't actually have a call and return setup, but it uses a series of branches. In their base form, a branch is identical to a JMP in x86. However, there are 2 special forms of this: the branch with exchange (BX) and the branch with link (BL). It's the second one we're interested in at the moment. That's more like x86's CALL. Here's a sample x86 program:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

This program does exactly the same thing as our above examples, but it uses a subroutine (like a function) to write the added value back to memory. Let's take a look at how we write this in ARM:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, these programs should look pretty much the same. Ok, so we've now covered all of the basics! Let's get to writing our first programs!

We'll start with the classic x86 linux hello world:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Note: code copied from

[To see links please register here]

Oh boy....every time I see that code I cringe a little. Not because it's terribly different from the ARM variant, but just because it looks so yucky. Let's start writing the ARM version and I'll explain the differences as we go.
First of all, since ARM has had the NX bit for ages (do some research on it), we don't need to differentiate our sections. The OS won't allow us write and execute permission on the same memory. So we already get to skip our section .text.
Also, since ARM has a large register file, we pass everything in as registers.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

If you've ever wondered how we know all of these numbers, we don't. This is the published Linux syscall interface. Find it

[To see links please register here]

. We follow the C ABI to interface with it.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

The next thing I want to talk about is how we called the OS. As you'll see, with x86 we call int 0x80, but for ARM we're calling SWI 0. This makes them basically the same, but it wasn't always like that. In fact, using Software Interrupt 0 is a relatively new EABI feature, and you shouldn't count on it to always be that. With other kernels (like XNU), you don't load R7 at all, you would call SWI 4 in that case. ARM is a very great system in that it's very easy to determine the interrupt number that was called. For example:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

This would jump into OS code, and notice how we didn't provide the OS with any info, we just called the interrupt? Well, we can make use of the link register to make sense of it all. This means, that you can write your own syscalls if you wanted to, and not interfere with the OS (since linux now uses SWI 0 for everything). To do this you would have a handler, and your handler would load a register with the value at R14 - 4, then mask off the first byte. That would leave you with the interrupt number, and you can go about your business from there.

Ok, everything else in that program was basically the same, and for this stage in the tutorial series it pretty much will always be the same. I want to spend a little bit of time to show you a list of some other basic instructions you'll be using:
Here's a graphic that lists out ALL of the instructions that do data processing
[Image: NOuGJ4X.png]

You may have seen this before in our CYFA tutorial series (if you haven't, go take a look). These are very basic instructions, and I don't think I need to explain them to you, however do note that the mnemonics are different from x86 in some places.

So, you saw the hello world program, now let's write a program (using subroutines) that lets us write "Hello World" to a file. We'll write our very own version of strlen for this, because hard coding the value just isn't right. Let's start with that (strlen).
I'm going to use the same definition of strlen as in manual chapter 3

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Let's start off with a basic subroutine structure:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, for this we're really only going to need 1 variable, so we'll shove that in R1. For the sake of making the programmer's life easier we'll also preserve that register, and note it in our comments.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

From this point forward, I'm not going to leave in the previous comments, just so it doesn't look so messy. Ok, so at this point we have our basic structure, all we need is our loop. We're going to loop as long as the current value isn't 0 (the NULL terminator). With ARM this is really easy if we use the S-bit (if you've forgotten what that was, reread the first part of this series).

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

I aligned the columns so you can see better. Let's start our loop

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok, that's a pretty simple ARM subroutine for strlen. One thing to notice here, is that we need to push and pop one more register (R2). Now we can do it like it's x86 and just add another PUSH and another POP instruction, but ARM has a trick up its sleeve for cases like this. The push multiple. With ARM, the PUSH instruction doesn't actually exist either, it's just an alias for STMDB using R13 as the base register. It just looks better to do it using the PUSH. So, let's make that change

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Notice the changes there. Also notice how the order isn't reversed. It doesn't have to be, since ARM stores the registers in order based on their number, it knows how to pop them out.

Perfect, our strlen function is done. Now, let's make our open and close file subroutines. We'll start with close, since it's the easiest. Again, start with our base:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, we need to go back to our handy

[To see links please register here]

and figure out the syscall number and parameters for sys_close

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Alright, now we know which registers we'll need. Looks like all we need to save is R7, because we don't have to set any arguments and sys_close doesn't return anything. So, let's push it, pop it, and don't bother with clearing R0.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, the rest of this is pretty simple. We just need to move integer 6 into R7 and call the OS. Since the user supplied us with the fd in R0, we don't need to change it at all.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, we need to work out how to open the file. We'll do this in a very basic way. We'll hard code some of the options so that the file always opens in read-write mode, creating the file if it doesn't exist, truncating it if it does, and opening with permissions 666 (read write all). This is the same as opening with fopen with the mode string being "wb". Let's first figure out our syscall number and argument list, then figuring out our hardcoded values.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Constants (from fcntl.h) -

[To see links please register here]

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok, so to get the flags, we just logical OR all of those 3. Rather than doing that in code, we'll just do it by hand (since it's hard coded).

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Alright, let's get started with our barebones subroutine

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, I do want to make a point here, we don't want to run it like this. We need to actually call everything first, since we're relying on sys_open to handle our errors for us.
Now, let's figure out what registers we need. We know we're going to need R7 for the syscall, and then we need 2 registers for our hard coded arguments. So, we need to preserve R1, R2, R7 and note that we destroy R0 (which we already did)

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Alright, now all we need to do is mov our parameters in, and call the OS

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Perfect! At this point, we have everything we need, aside from our main subroutine. Your code should look like this right now

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

I like to keep my labels actually on a code line, it makes the code look a little less like spaghetti to me. I also removed all of the "preserves all registers" lines, because the ABI is that we need to preserve them anyways.

Now, let's quickly pseudocode our main subroutine. It will be a pretty simple one now that we have all these nice wrappers for us.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

For now, let's pretend we have already defined the symbols string and file. We can start with the easiest part of it, step 5.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, let's work on #1. We know that it takes an argument (R0) that is a pointer to our file string, and returns a number (R0) that is the fd it opened. This is as simple as moving the arg in and branching to it (with a LINK!)

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Great, but now we have a different issue, when we go to get the length of the string, we'll end up overwriting our fd with that value. No worries, this is where we get to take advantage of ARM's massive register file. From now on, let's store the fd in R1. For the sake of making this fewer steps, we're going to store the length of the string in R2 (this way we don't have to move it). We'll then need to save R1, R2, and R7 (for the sys_write syscall).

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Great, now let's work on step 2. This one is nearly identical to step 1, so I won't comment it. We're just moving string to R0, branching (with link!) to strlen, then moving it to R2.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, on to step 3. This one is slightly different, because we're calling the syscall directly. Remember our sys_write stuff from hello world?

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

So, for this we just need to
R7 <= 4
R0 <= fd
R1 <= string
R2 <= string_len
Isn't it great that we already did the last step in that list :tongue:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

And now for our last step (that we still have to write), we need to close our file. This takes the fd in on R0 and doesn't return anything, so it's simple

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Sweet! Our finished start subroutine should look like the following:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

That means that our code is completely finished! All we need to do now is add the little bits of linker fluff around it. We know already that we need to have

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

above our start subroutine. Let's go ahead and do the data section

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

And we're done! Our finished ARM assembly program should look like the following (I've removed the step comments as well)

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

A hello world printed to a file program in pure ARM assembly using only 51 lines!
Now, I do want to note, this is NOT the most optimized form of this program, there are a couple ways to shave 2-7 lines out of this, but I wanted to keep it as simple as possible for you guys for now. I'll hand out rep or NSP to anyone who does make it more efficient.

10 NSP and +2 rep to the first person who can tell me exactly how long this program will take to run on a 1ghz CPU! (you can treat the software interrupts like a NOP for this.

I hope you enjoyed this one, it took me like 4 hours to type all of this up for you. @"Ender" I know you wanted to read this one, so here you go. In part 3, we'll talk about how to plan out these programs so they aren't as inefficient as this one is, and in part 4 we'll do some hard core optimizations of our code and really show those x86 idiots that ARM is king!

Well that was uhhm... long... Nice though, thanks for writing this
I expected it to end at "Hello World", but nope, you dove into file I/O.
I'll get that NSP later today (I hope), and I'll also read part 3. Time to break out a Raspberry Pi, hell, maybe I'll even write an ARM kernel to learn more about this.
I find it interesting how ARM chose a 3-argument format, you generally see only 2.
I'm considering learning a few assemblies for the hell of it, MIPS, 6502, Z80, whatever else. Z80 would be an easy one though, I already know 8086...

Quote:(03-18-2018, 02:08 AM)Ender Wrote:
[To see links please register here]
Well that was uhhm... long... Nice though, thanks for writing this
I expected it to end at "Hello World", but nope, you dove into file I/O.
I'll get that NSP later today (I hope), and I'll also read part 3. Time to break out a Raspberry Pi, hell, maybe I'll even write an ARM kernel to learn more about this.
I find it interesting how ARM chose a 3-argument format, you generally see only 2.
I'm considering learning a few assemblies for the hell of it, MIPS, 6502, Z80, whatever else. Z80 would be an easy one though, I already know 8086...

They actually did some crafty stuff so that using 3-arg instructions and 32-bit fixed width wouldn't prevent you from doing the same things that you can with variable width.
[Image: wOZGRtu.png]

The example here being the very wise use of the barrel shifter, which iirc only ARM has. It allows you to effectively do

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

all in one instruction, in one step:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Unfortunately, I am a lousy student. My grades aren't improving, despite my best efforts. Rather than pursuing a career in academics, I'd want to concentrate on my athletics. As soon as possible, I need to find a solution to my dissertation's problem. My friend recommends that I look into getting assistance with my dissertation from [link redacted] , and I'm taking her advice. In the event that I decide to write my own dissertation, their website provides detailed directions on how to do it.

thank you for sharing this with us.

This is a fantastic written guide, thank you very much fren! I'm sure it'll come in handy whenever I feel like catching up on ARM-ASM.

pantaloons214662

vulgarish400059

lightsome123

ethnographical742983

pibroch35

maheshvlmgsna